Necˇas Center Series

Jennifer Scott, Miroslav Tůma

Algorithms for Sparse Linear Systems

# **Necas Center Series ˇ**

#### **Editors-in-Chief**

Josef Málek , Charles University, Prague, Czech Republic Endre Süli, University of Oxford, Oxford, UK

#### **Managing Editor**

Beata Kubis, Czech Academy of Sciences, Prague, Czech Republic

#### **Editorial Board Members**

Peter Bastian, University of Heidelberg, Heidelberg, Germany Miroslav Bulícek, Charles University, Prague, Czech Republic ˇ Andrea Cianchi, University of Florence, Florence, Italy Camillo De Lellis, University of Zurich, Zurich, Switzerland Eduard Feireisl, Czech Academy of Sciences, Prague, Czech Republic Volker Mehrmann, Technical University of Berlin, Berlin, Germany Luboš Pick, Charles University, Prague, Czech Republic Milan Pokorný, Charles University, Prague, Czech Republic Vít Pr˚uša, Charles University, Prague, Czech Republic K R Rajagopal, Texas A&M University, College Station, TX, USA Christophe Sotin, California Institute of Technology, Pasadena, CA, USA Zdenek Strakoš, Charles University, Prague, Czech Republic ˇ Vladimír Šverák, University of Minnesota, Minneapolis, MN, USA Jan Vybíral, Czech Technical University, Prague, Czech Republic

The Necas Center Series aims to publish high-quality monographs, textbooks, ˇ lecture notes, habilitation and Ph.D. theses in the field of mathematics and related areas in the natural and social sciences and engineering. There is no restriction regarding the topic, although we expect that the main fields will include continuum thermodynamics, solid and fluid mechanics, mixture theory, partial differential equations, numerical mathematics, matrix computations, scientific computing and applications. Emphasis will be placed on viewpoints that bridge disciplines and on connections between apparently different fields. Potential contributors to the series are encouraged to contact the editor-in-chief and the manager of the series.

All manuscripts are peer-reviewed to meet the highest standards of scientific literature. Interested authors may submit proposals by email to the series editors or to the relevant Birkhäuser editor listed under "Contacts."

Jennifer Scott • Miroslav T˚uma

# Algorithms for Sparse Linear Systems

Jennifer Scott Department of Mathematics and Statistics University of Reading Reading, UK

Computational Mathematics Group STFC Rutherford Appleton Laboratory Harwell, UK

Miroslav T˚uma Faculty of Mathematics and Physics Charles University Prague, Czech Republic

ISSN 2523-3343 ISSN 2523-3351 (electronic) Necas Center Series ˇ ISBN 978-3-031-25819-0 ISBN 978-3-031-25820-6 (eBook) https://doi.org/10.1007/978-3-031-25820-6

This work was supported by University of Reading

© The Editor(s) (if applicable) and The Author(s) 2023, This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This book is published under the imprint Birkhäuser, www.birkhauser-science.com by the registered company Springer Nature Switzerland AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## **Preface**

The solution of linear systems of equations *Ax* = *b* is a cornerstone of computational science and engineering. Being able to solve linear systems in a reliable and efficient way is of great importance and interest not only to scientists and engineers but also to a huge and varied community of people who are unaware that at the heart of the software they are using lies a linear equation solver and that this is key to its feasibility and performance. In many applications, the linear systems that must be solved are large and square and they are **sparse** (that is, many of the entries in the system matrix *A* are zero). **Direct methods** for solving such systems are characterized by computing a **factorization** (or **decomposition**) of *A* into a product of much simpler matrices in such a way that solving systems of equations with these matrices is easy and inexpensive. For example, *A* may be factorized into a product of triangular matrices; in principle, solving a linear system in which the system matrix is triangular is straightforward. Direct methods obtain the solution to the linear system in a finite and fixed number of steps that is independent of *A* and *b*. Because of rounding errors, the computed solution is generally not equal to the exact one but, if a direct method is well implemented, the resulting software is extremely robust and can be used as a "black box solver", with the user not needing any detailed knowledge or understanding of what is going on within the box.

By contrast, an **iterative method** (sometimes also called an indirect method) generally involves an unknown number of steps and its performance is highly problem dependent. In many cases, for the method to converge to the sought-after solution of the linear system, it is necessary to use a **preconditioner**. This has to be tailored to the system being solved. The aim is to transform the linear system into one with more favourable numerical properties so that, when applied to the transformed system, the iterative solver converges to a solution of the requested accuracy in an acceptable number of steps. The major advantage of iterative solvers over direct ones is that they require very little memory and, once the preconditioner has been constructed, most of the computational work is in the application of the preconditioner and matrix-vector products with *A*. For extremely large problems (for example, systems coming from discretizations of real-world three- or fourdimensional problems), memory requirements prohibit the use of direct methods, and without suitable iterative methods the systems would be intractable.

This book presents classical techniques for matrix factorizations based on variants of Gaussian elimination that are used in sparse direct methods and discusses the construction of approximate direct and inverse factorizations that are key to developing algebraic preconditioners for use with iterative solvers. While a number of books on iterative solvers discuss the construction of simple incomplete matrix factorizations for use as preconditioners, very few attempt to unite the fields of complete and incomplete factorizations or cover contemporary approaches. To achieve this broad view, we use a single framework that emphasizes the underlying sparsity structures and highlights the importance of understanding sparse direct techniques when building algebraic preconditioners.

The book is algorithmically oriented, presenting computational schemes that are designed to provide both an understanding of sophisticated sparse factorization techniques and how they can be implemented in practice. Throughout, we include outline algorithmic descriptions and use pseudocode that is independent of any programming language. However, limitations on space mean that it is beyond the scope of the book to discuss the complex implementation details that are needed in the development of high-quality sophisticated (parallel) production software for efficiently solving sparse linear systems using modern computer architectures.

The book is aimed at students of applied mathematics and scientific computing as well as at computational scientists and software developers interested in understanding the theory and algorithms needed to tackle the challenge of solving large-scale linear systems. The presented treatment is intended to be largely self-contained, and we assume only that the reader has a basic knowledge of linear algebra and numerical mathematics.

The organization of the book is as follows. Chapter 1 provides a general introduction to sparse matrices and the challenges of solving large sparse linear systems of equations. Concepts from graph theory that are used in the development of sparse matrix algorithms are recalled in Chapter 2. The material in Chapters 1 and 2 is rather elementary, but it serves to remind the reader of important ideas and to introduce the notation and terminology that is used throughout the rest of the book. An introduction to sparse matrix factorizations, including the use of block forms, is given in Chapter 3. Then, in Chapters 4 and 5, the symbolic and numerical factorization phases of sparse Cholesky methods for solving the important class of symmetric positive definite linear systems are discussed. Sparse LU factorizations for general nonsymmetric sparse systems are described in Chapter 6. Chapter 7 is devoted to stability and pivoting strategies and includes a discussion of factorizing sparse symmetric indefinite systems. Sparse matrix ordering algorithms that are essential for the efficiency of sparse solvers are presented in Chapter 8.

The final three chapters of the book switch attention from direct methods to the study of algebraic preconditioners for use with iterative solvers. The emphasis is on employing and adapting ideas and concepts used by direct solvers in the development of effective general classes of preconditioners that can be used for tackling a wide range of problems, without relying on detailed knowledge of the properties of the underlying application. Chapter 9 introduces algebraic preconditioners and approximate factorizations. Chapters 10 and 11 then focus on two key classes of algebraic preconditioners: incomplete factorizations and sparse approximation inverse preconditioners.

We do not attempt to cite all the vast array of publications related to sparse direct methods and algebraic preconditioners. Furthermore, we do not include proofs for all the theoretical results that we present. Rather, for each theorem, we provide one or more citations to where the reader can find a proof and/or get a better understanding of the result. In general, we include citations to the original paper/book/report (or a textbook for standard results) and, in some cases, an additional citation that is either more accessible or presents an alternative proof. In addition, at the end of each chapter, we have a short section of notes with references to key publications that give a historical perspective and/or provide further reading. It is interesting to note that a Google Scholar search in July 2022 for the term "sparse matrix" lists more than 2.7 million results, while a search for "sparse matrix decompositions" gives in excess of a million results. Although the majority may not be relevant to our areas of interest, it does indicate the wealth of the available literature as well as the importance of sparse matrix algorithms and their widespread use.

This monograph and its study of sparse linear systems represents a natural extension of our successful long-term research collaboration, combined with the research and the software development projects that we have each worked on with other researchers. Past and present colleagues at the Rutherford Appleton Laboratory that Jennifer would particularly like to acknowledge and thank for many years of collaborations and enjoyable coffee time chats are Iain Duff, Nick Gould, Jonathan Hogg, Yifan Hu, Tyrone Rees, and John Reid. Miroslav would like to express his thanks to his first major collaborator Michele Benzi, from whom he learnt a lot, to Ivan Nemec, who invited him to work on codes that are now in ˇ the RFEM Structural Analysis and Engineering Software, and to his colleagues and friends in Prague, especially Zdenek Strakoš, Miro Rozložník, Josef Málek, ˇ Petr Tichý, and Iveta Hnetynková, who created a kind and productive working ˇ environment.

We are very grateful to Hussam Al Daas, Jonathan Hogg, and Gerard Meurant for reading and commenting on all or part of a draft of the book. They spotted errors and made suggestions that led to important improvements; we really appreciate the time they spent doing this for us. We would also like to thank our institutions for opportunities to spend time in Prague, the Rutherford Appleton Laboratory and Reading working on our joint research projects. Jennifer would like to acknowledge funding over the last 30 years from the Science and Technology Facilities Council and the Engineering and Physical Sciences Research Council. And we are extremely grateful to the University of Reading for providing the funding that allows this book to be published as open access.

And, finally, we each owe a huge debt of gratitude to our families. Jennifer wishes to dedicate the book to her close family, both those who are no longer with us and those who continue to be an important part of her life, and most especially Stewart, Emma, Simon, Mark, and Rebecca for their constant encouragement. Miroslav would like to dedicate the book to the memory of his ever-supportive parents and to thank Anna, Markéta and Martin, who have always tolerated his passion for research.

Harwell and Reading, UK Jennifer Scott Prague, Czech Republic Miroslav T˚uma August 2022

## **Contents**





## **Notation: Quick Reference Summary**

#### **Notational Conventions Used for Matrices and Vectors**


Different forms of double subscripted upper case italic letters:



### **Notational Conventions Used When Discussing Graphs**


The following are for an undirected graph G:


The following are for a digraph (directed graph) G:


*<sup>i</sup>* <sup>G</sup> ⇒ *min <sup>j</sup>* or *<sup>i</sup>* ⇒ *min <sup>j</sup>* All intermediate vertices on the path are less than min{*i, j* } (fill-path) *<sup>i</sup>* <sup>G</sup> ⇒ V*s j* or *i* ⇒ V*s j* All intermediate vertices on the path belong to V*<sup>s</sup>*

### **Specific Variables and Matrices That Are Used Throughout**


# **Abbreviations**


## **Chapter 1 An Introduction to Sparse Matrices**

*Let us begin with a few words about the subject itself. What are all these research workers trying to do? Mostly, they are trying to solve Ax* = *b... Amazing. Can people still find something new to say on these corny old subjects? The answer is yes . . . It is the pressure to solve bigger and more complex problems that has led people to return again and again to look in ever-increasing detail at such basic tools as a linear equations solver – Parlett (1974).*

*We may therefore interpret the elimination method as . . . the combination of two tricks: First, it decomposes A into a product of two [triangular] matrices . . . [and second] it forms their inverses by a simple, explicit, inductive process – Von Neumann & Goldstine (1947)*

### **1.1 Motivation**

Consider the simple matrix *A* on the left in Figure 1.1. Many of its entries are zero (and so are omitted). This is an example of a **sparse** matrix. The problem we are interested in is that of solving linear systems of equations *Ax* = *b*, where the square sparse matrix *A* and the vector *b* are given and the solution vector *x* is required. Such systems arise in a huge range of practical applications, including in areas as diverse as quantum chemistry, computer graphics, computational fluid dynamics, power networks, machine learning, and optimization. The list is endless and constantly growing, together with the sizes of the systems. For efficiency and to enable large systems to be solved, the sparsity of *A* must be exploited and operations with the zero entries avoided. To achieve this, sophisticated algorithms are required.

The majority of algorithms fall into two main categories: direct methods and iterative methods. **Direct methods** transform *A* using a finite sequence of elementary transformations into a product of simpler sparse matrices in such a way that solving linear systems of equations with these factor matrices is comparatively easy and inexpensive. For example, if *A* is symmetric, consider the Cholesky factorization *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* , where the factor *<sup>L</sup>* is a lower triangular matrix (and the superscript

**Figure 1.1** The locations of the nonzero entries in a sparse matrix from structural engineering (left) and in *<sup>L</sup>* <sup>+</sup> *LT* (right), where *<sup>L</sup>* is its Cholesky factor.

**Figure 1.2** The locations of the nonzero entries in a symmetric permutation of the matrix from Figure 1.1 (left) and in *<sup>L</sup>*¯ <sup>+</sup> *<sup>L</sup>*¯ *<sup>T</sup>* (right), where *<sup>L</sup>*¯ is the Cholesky factor of the permuted matrix.

*L<sup>T</sup>* denotes the transpose of *L*). Solving linear systems with a triangular matrix is generally cheaper and more straightforward than for a general matrix. For the matrix in Figure 1.1, it is clear that *L* has filled in, that is, compared to *A*, it has more nonzero entries. If the amount of fill-in is too high, then the advantages of having a triangular matrix will be lost. An important question is: can we permute the rows and columns of *A* so as to reduce the fill-in in its factor *L*? One possibility is shown in Figure 1.2. Here *A* has been symmetrically permuted to give a matrix that has a much sparser factorization *L*¯*L*¯ *<sup>T</sup>* .

Having fewer entries in *L*¯ reduces both the required storage and the number of operations that are needed to compute it and that must be performed when using it. This simple example suggests other possible questions, such as: how can the positions of the nonzero entries in *A* and in its factors be described? How can the sparsity pattern of the factors be determined from that of *A*? What influences the computational efficiency of matrix factorizations and other matrix transformations on contemporary computers?

Direct methods built on matrix factorizations are designed to be robust so that, properly implemented, they can be confidently used as black-box solvers for computing solutions with predictable accuracy. However, they can be expensive, requiring large amounts of memory, which increases with the size of *A*. By contrast, **iterative methods** compute a sequence of approximations

$$x^{(0)}, x^{(1)}, x^{(2)}, \dots$$

that (hopefully) converge to the solution *x* of the linear system in an acceptable number of iterations. The number of iterations depends on the initial guess *x(*0*)* , *A* and *b* as well as the accuracy that is wanted in *x*. Iterative methods use the matrix *A* only indirectly, through matrix–vector products, and their memory requirements are limited to a (small) number of vectors of length the order of *A*, making them attractive for very large problems and problems where *A* is not available explicitly. They can be terminated as soon as the required accuracy in the computed solution is achieved. Unfortunately, frequently convergence does not happen or the number of iterations is unacceptably large; in such cases, preconditioning is needed. The aim of preconditioning is to speed up convergence by transforming the given linear system into an equivalent system (or one from which it is easy to recover the solution of the original system) that has nicer numerical properties. For example, the transformed system could be

$$M^{-1}Ax = M^{-1}b,$$

where the matrix *M* is the **preconditioner** and *M*−<sup>1</sup> denotes its inverse. Knowledge of the underlying problem, such as whether or not it arises from a partial differential equation, can help in the construction of an effective preconditioner. Otherwise, purely algebraic approaches that simply take the entries of *A* as input may be used. The class of **algebraic preconditioners** includes those based on incomplete (or approximate) factorizations of *A*. In this case, possible questions include: can some of the factor entries be discarded to obtain a sparser but approximate factor that is useful as a preconditioner? If so, which entries can be discarded? What are the implications of this on the associated computational costs?

This book uses a unified framework to address such questions for direct methods and algebraic preconditioners, examining both the theoretical and algorithmic aspects of solving large-scale linear systems of equations.

#### **1.2 Introductory Terminology and Concepts**

Our interest is in solving linear systems of equations

$$Ax = b,\tag{1.1}$$

where the matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*×*n,* <sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *n,* is **nonsingular** and **sparse**, the righthand side vector *<sup>b</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is given (it may be sparse or dense), and *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the required solution vector. *n* is the **order** (or dimension) of *A* and the **length** of *x* and *b*. Although we focus on real *A*, many of the results and algorithms we present are valid for complex *A*.

Entries of *A* are referred to using the notation

$$A = (a\_{lj}), \quad 1 \le i, \, j \le n.$$

An entry whose value is not zero (or is treated as not being equal to zero) is called a **nonzero**. Column *j* of *A* is denoted by *A*1:*n,j* (or *A*:*,j* ) and row *i* by *Ai,*<sup>1</sup>:*<sup>n</sup>* (or *Ai,*:). *Ai*:*j,k*:*<sup>l</sup>* denotes the *(j* − *i* + 1*)* × *(l* − *k* + 1*)* submatrix of *A* comprising rows *i* to *j* , columns *k* to *l*. *A* is **diagonal** if for all *i* = *j* , *aij* = 0; it is **lower triangular** if for all *i<j* , *aij* = 0; it is **upper triangular** if for all *i>j* , *aij* = 0. *A* is **unit triangular** if it is triangular and all the entries on the diagonal are equal to unity.

The matrix *A* is **structurally symmetric** if for all *i* and *j* for which *aij* is nonzero the entry *aj i* is also nonzero. *A* is **symmetric** if

$$a\_{lj} = a\_{jl}, \text{ for all } i, j.$$

Otherwise, *A* is **nonsymmetric**. The **symmetry index** *s(A)* of *A* is defined to be the number of nonzeros *aij* , *i* = *j* , for which *aj i* is also nonzero divided by the total number of off-diagonal nonzeros. Small values of *s(A)* indicate the matrix is far from symmetric, while values close to unity indicate an almost symmetric pattern. *A* is **symmetric positive definite** (SPD) if it is symmetric and satisfies

$$v^T A v > 0 \text{ for all nonzero } v \in \mathbb{R}^n.$$

Otherwise, *A* is **symmetric indefinite**. An important class of symmetric indefinite matrices are **saddle point matrices** of the form

\*\*finite. An important\*\*

\*\*es of the form

$$A = \begin{pmatrix} G & R^T \\ R & B \end{pmatrix}.$$

where *<sup>G</sup>* <sup>∈</sup> <sup>R</sup>*n*1×*n*<sup>1</sup> , *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>2</sup> , *<sup>R</sup>* <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>1</sup> with *<sup>n</sup>*<sup>1</sup> <sup>+</sup> *<sup>n</sup>*<sup>2</sup> <sup>=</sup> *<sup>n</sup>*, *<sup>G</sup>* is an SPD matrix, and *<sup>B</sup>* is a symmetric positive semidefinite matrix (that is *<sup>v</sup><sup>T</sup> Bv* <sup>≥</sup> <sup>0</sup> for all nonzero *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*n*<sup>2</sup> ). In some applications, *<sup>B</sup>* <sup>=</sup> 0.

As we will see later, it can be useful to partition the general matrix *A* into blocks. We formally express the partitioning as

⎜

⎜

⎜

$$A = (A\_{lb,jb}), \ A\_{lb,jb} \in \mathbb{R}^{n\_l \times n\_j}, \ 1 \le ib, jb \le nb,\tag{1.2}$$

⎟

⎟

⎟

that is,

$$A = \begin{pmatrix} A\_{1,1} & A\_{1,2} & \cdots & A\_{1,nb} \\ A\_{2,1} & A\_{2,2} & \cdots & A\_{2,nb} \\ \vdots & \vdots & \ddots & \vdots \\ A\_{nb,1} & A\_{nb,2} & \cdots & A\_{nb,nb} \end{pmatrix}.$$

We assume the square blocks *Ajb, jb* on the diagonal are nonsingular. We say that *A* is **block diagonal** if *Aib, jb* = 0 for all *ib* = *j b*. *A* is **block lower triangular** if *A*1:*j b*−1*,jb* = 0, 2 ≤ *j b* ≤ *nb*, and it is **block upper triangular** if *Aj b*+1:*nb, j b* = 0, 1 ≤ *j b* ≤ *nb* − 1.

Direct methods factorize the sparse matrix *A* into a product of other sparse matrices; what is an appropriate factorization depends on the properties of *A*. In this book, the focus is on the following variants of Gaussian elimination. 


As already observed, *A* is sparse if many of its entries are zero. Frequently, large matrices that arise in practical problems are sparse, and when solving large-scale linear systems, taking advantage of the sparsity is essential; indeed, many problems are intractable unless advantage is taken of sparsity to reduce the computational costs in terms of storage and the number of operations that must be performed. What proportion of the entries needs to be zero for the matrix to be considered as sparse is not fixed and can depend on the pattern of the entries, the operations to be performed, and the computer architecture. There have been attempts to formalize matrix sparsity more precisely. For example, a matrix of order *n* may be said to be sparse if it has *O(n)* nonzeros. But here we choose not to employ a formal definition. Instead, we say that *A* is **sparse** if it is advantageous to exploit its zero entries. Otherwise, *A* is **dense**.

The **sparsity pattern** S{*A*} of *A* is the set of nonzeros, that is,

$$\mathcal{S}\{A\} = \{(i,j) \mid a\_{lj} \neq 0, \ 1 \le i, \ j \le n\}.$$

The number of nonzeros in *A* is denoted by *nz(A)* (or |S{*A*}|). *A* is **structurally (or symbolically) singular** if there are no values of the *nz(A)* entries of *A* whose row and column indices belong to S{*A*} for which *A* is nonsingular. S{*A*} is symmetric if for all *i* and *j* , *aij* = 0 if and only if *aj i* = 0 (the values of the two entries need not be the same). If S{*A*} is symmetric, then *A* is structurally symmetric.

In some situations, sparse vectors (vectors that contain many zero entries) are considered. The sparsity pattern of a vector *v* of length *n* is given by

$$\mathcal{S}\{\boldsymbol{v}\} = \{\boldsymbol{i} \mid \boldsymbol{v}\_{\boldsymbol{i}} \neq \boldsymbol{0}, \ 1 \leq \boldsymbol{i} \leq \boldsymbol{n}\},$$

and |S{*v*}| denotes the number of nonzeros in *v*. Note that here and elsewhere curly brackets {*.*} are used when working with sets to help distinguish sets from vectors.

We say that the matrix *A* is **factorizable** (or **strongly regular**) if its principal leading minors (the determinants of its principal leading submatrices) are nonzero, that is, if its LU factorization without row/column interchanges does not break down. For example, SPD matrices are factorizable. For more general *A*, in exact arithmetic, the following standard result holds.

#### **Theorem 1.1 (Golub & Van Loan 1996)**

*If A is nonsingular, then the rows of A can be permuted so that the permuted matrix is factorizable.*

The row permutations do not need to be known in advance of the factorization; rather they can be constructed as the factorization proceeds.

#### *1.2.1 Phases of a Sparse Direct Solver*

A direct method for solving the sparse system (1.1) comprises a number of distinct phases. The matrix *A* is factorized, and then, given the right-hand side *b*, the factors used to compute the solution *x*. There is no single direct method that performs best on all problems and all computer architectures. Instead, many different algorithms have been proposed and implemented, some focussing on special classes of problems and/or particular architectures. However, in general, most approaches split the factorization into a **symbolic phase** (also called the **analyse phase**) and a **numerical factorization phase** that computes the factors. The symbolic phase typically uses only the sparsity pattern S{*A*} to compute the nonzero structure of the factors of *A* without computing the numerical values of the nonzeros. Following the numerical factorization, the **solve phase** uses the factors to solve for a single *b* or for multiple right-hand sides or for a sequence of right-hand sides one-by-one.

The fill-in in the matrix factors can render a direct method infeasible. Thus the symbolic phase typically incorporates finding a permutation (ordering) of the rows and columns of *A* to limit fill-in. There are many different ways to look for fillreducing orderings; this is discussed in Chapter 8. Once the permutation has been selected, the symbolic phase determines the sparsity pattern of the factors of the permuted matrix and other key properties such as the number of entries in each row and column of the factors. This is achieved using the close relationships between matrices and graphs, which we review in Chapter 2. A symbolic factorization can also be used in algorithms that construct approximate factorizations by dropping nonzeros from *A* and factoring the resulting sparser matrix. These approximate factors can be employed as preconditioners for an iterative method.

Historically, the symbolic phase was much faster than the factorization phase, but considerable effort has gone into parallelizing the factorization so that the gap between the times for the two phases has narrowed. Indeed, the ordering part of the symbolic phase can dominate the total solution time. To prevent the symbolic phase from becoming a computational bottleneck, it needs to use efficient implementations of sophisticated algorithms. By setting up the data structures needed for computing and holding the factors, the symbolic factorization contributes to the efficiency of the subsequent numerical factorization in terms of time and memory. In many applications (for instance, when solving nonlinear equations), it is necessary to solve a series of problems in which the numerical values of the entries of *A* change but S{*A*} does not. In this case, the symbolic phase can generally be performed just once and its cost amortized across the numerical factorizations.

#### *1.2.2 Comments on the Computational Environment*

The von Neumann architecture—the fundamental architecture upon which nearly all digital computers have been based—involves the union of a central processing unit (CPU) and the memory, interconnected via input/output (I/O) mechanisms, as depicted in Figure 1.3. Despite being extremely simple, this sequential model remains useful, although nowadays the role of the CPU is undertaken by a mixture of powerful processors, co-processors, cores, GPUs, and so on, and current computer architectures employ complex memory hierarchies. Performing arithmetic operations on the processing units is much faster than communication-based operations. Moreover, improvements in the speed of the processing units outpace those in the memory-based hardware. Moore's law is an example of an experimentally derived observation of this kind.

**Figure 1.3** A simple uniprocessor von Neumann computer model.

Two important milestones in processor development have been **multiple functional units** that compute identical numerical operations in parallel and **data pipelining** (also called **vectorization**) that enables the efficient processing of vectors and matrices. Vectorization is often supported by additional hardware and software tools (for instance, **instruction pipelining**) and by memory components such as **registers** and by memory architectures with multiple layers, including small but fast memories called **caches**. Superscalar processors that enable the **overlapping** of identical (or different) arithmetic operations during runtime have been a standard component of computers since the 1990s. The ever-increasing heterogeneity of processing units and their hardware environment inside computers has led to significant effort being invested to support code implementations. For example, expressing the code via units of scheduling and execution called **threads**.

A key objective of many numerical linear algebra algorithms is reducing time to solution. This is usually bound by one of the following.


Depending on which of these is the constraining factor, a given algorithm is said to be compute-bound, memory-bound, or latency-bound. Latency can often be hidden by performing non-dependent operations arising from a different part of a vector or matrix while waiting for a result, and as such is most typically a constraining factor for small problems or, more rarely, in the execution of complex algorithms on less powerful processors where resource limitation (for example, the number of registers) prevents such approaches.

On modern machines, the memory throughput is normally much lower than that required to keep all functional units busy without significant reuse of operands, and this is generally true at all levels of cache. It can be useful to consider an algorithm's compute intensity, that is, the ratio of the number of operations to the number of operands read from memory. Most chips are designed such that dense matrix–matrix multiply, which typically performs *n*<sup>3</sup> operations on *n*<sup>2</sup> data (with ratio *k* for a blocked algorithm with block size *k*), can run at full compute throughput, while matrix–vector multiply performs *n*<sup>2</sup> operations on *n*<sup>2</sup> data (ratio 1) and is limited by the memory throughput. The development of basic linear algebra subroutines (**BLAS**) for performing common linear algebra operations on dense matrices was partially motivated by obtaining a high ratio. In the late 1980s, matrix– matrix operations (implemented by Level 3 BLAS) became a must once computers were able to store matrix blocks with accompanying processor instructions inside registers and fast caches. Matrix–matrix operations are able to take advantage of the fact that data that are reused within a small amount of time or are stored in close memory locations (temporal and spatial locality) are processed efficiently. Consequently, employing Level 3 BLAS when designing and implementing matrix algorithms (for both sparse and dense matrices) can improve performance compared to using Level 1 and Level 2 BLAS.

There are other important motivations behind using the BLAS. In particular, they facilitate software development by providing standardized codes for performing common vector and matrix operations that are robust, efficient, and portable. Machine-specific optimized BLAS libraries are available for a wide variety of computer architectures, and because of the importance and widespread use of the BLAS, new implementations are provided by computer vendors as architectures change.

In this book, we discuss the design of algorithms that aim to achieve computational efficiency through exploiting data locality and using established matrix block and vector operations as fundamental building blocks. We assume an idealized computer model, not a specific architecture or language.

#### *1.2.3 Finite Precision Arithmetic*

When designing numerical algorithms, it is important to consider how the numerical operations are performed and the effects of computational errors. Finite precision arithmetic underlies all computations that are performed numerically. Historically, computer arithmetic varied greatly between different computer manufacturers, and this was a source of many problems when attempting to write software that could be easily ported between computers. Variations were reduced significantly in 1985 with the development of the Institute for Electrical and Electronic Engineering (IEEE) standard for computer floating-point arithmetic. The IEEE standard is now widely used, and the majority of contemporary computers represent real numbers using binary floating-point arithmetic that expresses real numbers as

$$a = \pm d\_{\mathbb{I}}.d\_{\mathbb{Z}}\dots d\_{\mathbb{I}} \times \mathbb{Z}^k,$$

where *k* is an integer and *di* ∈ {0*,* 1}*,* 1 ≤ *i* ≤ *t*, with *d*<sup>1</sup> = 1 unless *d*<sup>2</sup> = *d*<sup>3</sup> = *...* = *dt* = 0. The number of digits *t* is 24 in single precision and 53 in double precision. The exponent *k* lies in the range −126 ≤ *k* ≤ 127 in single precision and −1022 ≤ *k* ≤ 1023 in double precision. Floating-point operations can be written as

$$\operatorname{fl}(a\ o p\ b) = (a\ o p\ b)(1+\delta), \qquad |\delta| \le \epsilon,$$

where *op* is a mathematical operation (such as <sup>=</sup>*,* <sup>+</sup>*,* <sup>−</sup>*,* <sup>×</sup>*, /,* <sup>√</sup>) and *(a op b)* is the exact result of the operation, and is the **machine precision** (or unit roundoff). 2× is the smallest floating-point number that when added to the floating-point number 1.0 produces a result that is different from 1.0. For IEEE single precision arithmetic, is <sup>2</sup>−<sup>24</sup> <sup>≈</sup> <sup>10</sup>−<sup>7</sup> and for double precision <sup>=</sup> <sup>2</sup>−<sup>53</sup> <sup>≈</sup> <sup>10</sup>−16. Any operation on floating-point numbers should be thought of as introducing a relative error of absolute value at most . When the results of such operations are fed into other operations to form an algorithm, these errors propagate through the calculations. The two main sources of computational errors that are consequences of floatingpoint arithmetic are rounding errors and truncation errors. Certain operations can amplify the errors and lead to catastrophic failure when algorithms that are exact in conventional arithmetic are executed in floating-point arithmetic. Such algorithms are said to be **numerically unstable**; for sparse linear systems, this is discussed in Chapter 7.

#### *1.2.4 Bit Compatibility*

For sequential solvers, achieving bit compatibility (in the sense that two runs on the same machine using the same binary and identical input data should produce identical output) is not a problem. But enforcing bit compatibility can limit dynamic parallelism, and when designing parallel sparse solvers, the objective of efficiency potentially conflicts with that of bit compatibility. Bit compatibility is essential for some users because of regulatory requirements (for example, within the nuclear or financial industries) or to build trust in their software from nontechnical users (who may find the non-reproducibility of results worrying or unacceptable). For others, it is just a desirable feature for debugging purposes. Often linear solves occur at the core of much more complicated codes that typically feature heuristics that can be sensitive to very small changes in the linear solutions found.

The critical issue is the way in which *N* numbers (or, more generally, matrices) are assembled, that is,

$$\begin{aligned} \text{which } N \text{ number} \\\\sum\_{j=1}^{N} C\_j, \end{aligned}$$

where the *Cj* are computed using one or more processors. The assembly is commutative but, because of the potential rounding of the intermediate results, is not associative so that the result *sum* depends on the order in which the *Cj* are assembled. A straightforward approach to achieving bit compatibility is to enforce a defined order on each assembly operation, independent of the number of processors, but this may adversely limit the scope for parallelism.

#### *1.2.5 Complexity of Algorithms*

The computational complexity of a numerical algorithm is typically based on estimating asymptotically the number of integer or floating-point operations or the memory usage. Computational complexity is expressed as a function of the algorithm's input parameters (typically the problem size) and is concerned with how fast that function grows. Only the highest order terms are considered: scalar factors and lower order terms are ignored. For simplicity, consider a single input parameter. A real function *y(d)* of a nonnegative real *d* satisfies *y* = *O(g)* if there exist positive constants *c* and *d*<sup>0</sup> such that

$$|\mathbf{y}(d)| \le c\mathbf{g}(d) \text{ for all } d \ge d\_0.$$

*O(g)* bounds *y* asymptotically from above. As a simple illustration, consider the quadratic function in *d*

$$
\varphi(d) = \alpha d^2 + \beta d - \gamma, \qquad \alpha \neq 0.
$$

In this case, *y(d)* <sup>=</sup> *O(d*2*)*, and the coefficient of the highest asymptotic term is *<sup>α</sup>*. In some cases, a function can also be asymptotically bounded from below. However, we will only use the *O(.)* notation because it is more important for sparse matrix algorithms to specify upper bounds than to discuss special cases that may imply lower bounds.

Computational complexity can estimate quantities related to the worst-case behaviour of an algorithm or its average behaviour. When considering complexity based on operation counts, as a result of using a unit-cost random-access computer model, it is common to assume the operations have a unit cost. But in practice there can be a significant difference between the cost of operations, such as addition and subtraction, and operations with integer operands or operations using different precisions. Division and square root operations can be significantly more expensive than multiply/add operations; the difference is highly dependent on the computing platform. Thus, unit cost can be a significant simplification, and counting floatingpoint operations is arguably of limited value in assessing the performance of different algorithms on modern computers. Nevertheless, sparse matrix algorithms that are *O(n*3*)* are considered to be computationally too expensive: the goal when designing algorithms is that they should be of linear (or close to linear) in the input, that is, linear in *n* or *nz(A)*. Linear complexity is often achieved in the symbolic phase of a sparse direct solver, but the complexity of the numerical factorization phase is typically higher and may determine the size of the linear systems that can be solved using a sparse direct method. However, for modern computer architectures, the number of floating-point operations is not necessarily a good indicator of the time required to solve the linear system. Indeed, parallel implementations of algorithms that perform more operations than the minimum needed can lead to reductions in the runtime because costly data movements and synchronizations can be limited by, for example, duplicating operations on multiple processors.

As computers have become more powerful (in terms of both the computational speed and the available memory), the size of the linear systems that can be solved using a (parallel) dense method that ignores sparsity in *A* has steadily increased; nowadays linear systems with *n* of the order 105 can potentially be tackled using a dense solver (although if *A* is sparse, the operation count and solution time will generally be greatly reduced by using algorithms that limit operations on zeros). Many practical applications lead to systems where *A* is sparse and *n* is significantly larger than this. The size of systems that can be solved using a sparse direct method has also steadily increased over the years, and the algorithms they use have become ever more sophisticated so that it is commonplace to solve systems of order greater than 107. But the complexity does limit the problem size, and for very large systems, an iterative solver is often the only option.

In computer science, complexity theory introduces additional concepts and distinguishes between problems for which algorithms of polynomial complexity exist and those where a hypothesis is that only algorithms of super polynomial complexity exist. Without going into detail, we refer to problems in this latter class as being **combinatorially hard**.

### **1.3 Sparse Matrices and Their Representation in a Computer**

To implement sparse matrix algorithms on a computer requires special **data structures** and **storage schemes** that allow matrices and vectors to be stored, retrieved, manipulated, and updated. There are many ways to do this; key to them all is that they must be compact and avoid storing and manipulating numerically zero entries.

#### *1.3.1 Sparse Vector Storage*

A sparse vector can be stored using a real array for the nonzero values together with an integer array containing the indices of these entries, as demonstrated by the following example. *v* = 

*Example 1.1* Let *v* be the sparse row vector

$$v = \begin{pmatrix} 1. & -2. & 0. & -3. & 0. & 5. & 3. & 0. \end{pmatrix}. \tag{1.3}$$

The real array valV that stores the nonzero values and corresponding integer array of their indices indV is of length |S{*v*}| = 5 and is as follows:


Alternatively, a **linked list** can be used. While modern programming languages often support linked lists directly as an abstract data structure, in sparse matrix algorithms it is usual to implement them explicitly using arrays together with an integer that points to the first entry (the header pointer). Each entry is associated with a link that points to the next entry or is null if the entry is the last in the list. The links can be adjusted so that the values are scanned in a different order without moving the physical locations. Storing the vector (1.3) as a linked list is illustrated in Example 1.2. Here *v* is stored in two different ways, emphasizing that the order of the entries is determined by the links, not by the physical locations of the entries.

*Example 1.2* Two possible ways of storing the sparse vector (1.3) using linked lists.


There are two important reasons for using linked lists. Firstly, it is straightforward to add extra entries, and secondly, entries can be removed without any data movement. This is illustrated in Example 1.3. Linked lists are an example of a **dynamic** structure.

*Example 1.3* On the left, an entry −4 has been added to the sparse vector (1.3) in position 5, and, on the right, the entry −2 in position 2 has been removed. ∗ indicates the entry is not accessed. The links that have changed are in bold.


#### *1.3.2 Sparse Matrix Storage*

The vector data structures can be generalized to sparse matrices. The simplest way to store a sparse matrix is using **coordinate** (or **triplet**) format. The individual entries of *A* are held as triplets *(i, j, aij )*, where *i* is the row index and *j* is the column index of the entry *aij* = 0. Three arrays (one real and two integer) each of length *nz(A)* are needed. Although this form is easy to create, it is not efficient for manipulating sparse matrices (for example, just adding two sparse matrices with different sparsity structures presents difficulties).

The **CSR (Compressed Sparse Row)** format is widely used. The column indices of the entries of *A* are held by rows in an integer array (which we will call colindA) of length *nz(A)*, with those in row 1 followed by those in row 2, and so on (with no space between rows). Often, within each row, the entries are held by increasing column index. A real array valA of the same length holds the values of the corresponding entries of *A* in the same order. A third array rowptrA of length *n*+1 is such that its *i*-th entry points to the position of the start of row *i* (1 ≤ *i* ≤ *n*) of *A* within colindA and valA, and rowptrA*(n* + 1*)* is set to *nz(A)* + 1.

**CSC (Compressed Sparse Columns)** format is defined analogously by holding the entries by columns, rather than by rows. If *A* is symmetric, only the lower (or upper) triangular part is generally stored. If the matrix values are not stored, the arrays rowptrA and colindA represent the graph G*(A)*, which we discuss in the next chapter. ⎛⎞

*Example 1.4* Let *A* be the sparse matrix ⎜

$$A = \begin{array}{ccccc} & 1 & 2 & 3 & 4 & 5 \\ & 2 & & & & \\ & 2 & & & & \\ A = & 3 & & & & \\ & & 4 & & & \\ & & & 1. & \\ & & & 7. & & 6. \\ \end{array} \tag{1.4}$$

⎟

Coordinate format represents *A* as follows. Note that the entries are in no particular order.


CSR format represents *A* as follows. Here the entries within each row are in order of increasing column index. This additional condition is often but not always used.


The CSR and CSC formats are **static** data structures. While reading *A* is straightforward, it can be difficult to make modifications, for instance, adding a new entry at a specified location. Removing an entry is also problematic. The value of the entry could be set to zero, but if a significant number of entries are set to zero, this may not be efficient because, when *A* is used, operations are performed on zeros and more memory than is necessary is used. Adding and deleting entries are possible if the sparse rows or columns are stored using linked lists.

*Example 1.5* The matrix in (1.4) can be held as a collection of columns, each in a linked list, as follows. Here the array colA\_head holds header pointers, with the *i*-th entry pointing to the location of the first entry in column *i*.


For column 4, colA\_head*(*4*)* = 5, rowindA*(*5*)* = 1 and valA*(*5*)* = −2, so the first entry in column 4 is *a*<sup>14</sup> = −2. Next, link*(*5*)* = 4, rowindA*(*4*)* = 4, and valA*(*4*)* = 1, so the second entry in column 4 is *a*<sup>44</sup> = 1. Because link*(*4*)* = 0, there are no more entries in the column. If we want to add an entry to the *(*3*,* 4*)* position while retaining the order of the entries within column 4, then we do this by setting valA*(*11*)* to hold the new entry, and rowindA*(*11*)* = 3, link*(*5*)* = 11, and link*(*11*)* = 4 (the original value of link*(*5*)*). The resulting link array is shown below, with the entries that have changed given in bold.


A disadvantage of linked list storage is that it prohibits the fast access to rows (or columns) of the matrix that is needed for efficient processing on contemporary computers that use vectorization and/or work with matrix blocks. Consequently, CSR or CSC formats are commonly used in sparse direct methods.

Static data structures are efficient for sparse matrix factorizations if the sparsity structures of the factors are known before the factorization begins. However, it is often the case that new nonzero entries need to be added and/or others need to be removed, and it is not necessarily possible to predict the required space in advance. A storage scheme that has some space to embed new nonzeros is the **DS (Dynamic Sparse)** format. It stores the nonzeros of both the rows and columns of *A* in real arrays valAR and valAC, with the corresponding row and column indices held in integer arrays rowindA and colindA. Pointers to the start of each row and column are stored in the integer arrays rowptrA and colptrA, as in the CSR and CSC formats. In addition, the lengths of the compressed rows and columns (which are called row and column segments) are stored separately. In some situations, it can be sufficient to hold only the row (or the column) information (DSR and DSC formats). The following example illustrates the DS format.

*Example 1.6* Consider again the matrix given by (1.4). The DS format represents *A* using two sets of arrays. The first four store the matrix by rows, and the second four store it by columns. The entries are in no particular order in both sets of arrays. The arrays rlength and clength hold the numbers of entries in the rows and columns, respectively. Free space between segments can be used to store new nonzero entries, and it is this that makes the storage scheme efficient, provided the number of changes to the matrix structure during the factorization is limited.


Blocked formats may be used to accelerate multiplication between a sparse matrix and a dense vector. Iterative methods typically require that the same sparse matrix is multiplied by vectors many times before a solution is found. The matrix can be put into a block storage format once, and then the cost of finding the blocks and converting the matrix format can be offset by the savings that result from repeatedly multiplying the matrix. The **Variable Block Row (VBR)** format groups together similar adjacent rows and columns. The numbers of such rows and columns can be different in each dimension, resulting in variable sized blocks. For a large sparse block-structured matrix, using a VBR format potentially reduces the amount of integer storage, and the block representation enables numerical algorithms to perform the kernel matrix operations more efficiently on the block entries. However, only heuristic algorithms are available for determining the groupings of the rows and columns.

The data structure of the VBR format uses six arrays. Integer arrays rptr and cptr hold the index of the first row in each block row and the index of the first column in each block column, respectively. In many cases, the block row and column partitionings are conformal, and only one of these arrays is needed. The real array valA contains the entries of the matrix block-by-block in column-major order. The integer array indx holds pointers to the beginning of each block entry within valA. The index array bindx holds the block column indices of the block entries of the matrix, and finally, the integer array bptr holds pointers to the start of each row block in bindx. ⎛⎞

⎟

*.*

*Example 1.7* Let *A* be the sparse matrix ⎜

$$A = \begin{pmatrix} 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8\\ 1 & 2 & & & & 3. & & &\\ 2 & 4 & 5. & & & 6. & &\\ 3 & & 7. & 8. & 9. & 10. &\\ & & 7. & 8. & 9. & 10. &\\ 11. 12. & & & & 15. & 16. &\\ & 6 & & 13. & & & 17. &\\ & & 14. & & & & 18. &\\ & & & 19. & 20. &\\ & & & 21. & 22. & & \end{pmatrix}$$

Here the row blocks comprise rows 1:2, 3, 4:6, and 7:8. The column blocks comprise columns 1:2, 3:5, 6, 7:8. The VBR format stores *A* as follows.


#### **1.4 Notes and References**

There are some excellent textbooks that provide in-depth coverage of numerical linear algebra for dense matrices (such as Golub & Van Loan, 1996; Demmel, 1997; Trefethen & Bau, 1997, and Strang, 2007). Although sparse direct methods have been a constant subject for research since the 1960s and despite their importance and widespread use, there has only ever been a handful of books focusing on them. The most recent are Davis (2006) and Duff et al. (2017), but see also Tewarson (1973), George & Liu (1981), Pissanetzky (1984), and Zlatev (1991). In addition, Meurant (1999) covers both direct and iterative methods. The books by Björck (1996, 2015) and Wendland (2017) are also relevant.

We focus on factorizations based on Gaussian elimination, but another important class of direct methods are those based on orthogonal factorizations, most notably QR factorizations of the form *A* = *QR*, where *Q* is an orthogonal matrix and *R* is an upper triangular matrix. These methods are generally more expensive than those that use LU factorizations (in terms of operation counts, the density of the factors, and the time required to solve the linear system), but they can offer advantages in terms of numerical stability. We refer the reader to the book by Davis (2006) for a study of such approaches.

Over the last fifty years, in addition to the huge quantity of journal articles relating to specific aspects of sparse direct methods, a number of useful survey and overview papers have been published. These not only summarize important aspects of sparse direct methods but provide interesting historical perspectives on the theoretical, algorithmic, and software developments in the field. Early surveys include Tewarson (1970), Reid (1974), Duff (1977, 1981), while the comprehensive survey of Demmel et al. (1993) sums up early developments in parallel sparse direct solvers. Gould et al. (2007) look specifically at software that implements sparse direct methods, while the excellent survey of Davis et al. (2016) includes many further references to review papers and early conference proceedings where some of the key ideas related to sparse direct methods were first introduced. A short overview of modern sparse elimination methods is given by Bollhöfer et al. (2020).

A wide range of books devoted to iterative methods for solving large-scale linear systems have been written, for example, Axelsson (1994), Greenbaum (1997), Saad (2003b), van der Vorst (2003), Olshanskii & Tyrtyshnikov (2014), Meurant & Duintjer Tebbens (2020), Bai & Pan (2021), and Ciaramella & Gander (2022).

There are many references to contemporary computational environments. To understand the basic principles and connection of computations with basic linear algebra subroutines (BLAS), a good starting point is Dongarra et al. (1998), while contributions in van der Vorst & Van Dooren (2015) provide a general resource on parallel computation in numerical linear algebra. Specific features of finite precision arithmetic in this field are clearly and thoroughly explained in Higham (2002). For the complexity of algorithms as well as for much of the terminology related to the sparse data structures used in this book, we refer to Tarjan (1983); we also recommend Cormen et al. (2009) or Skiena (2020).

Texts providing details of the storage formats that are primarily for sparse direct methods include Pissanetzky (1984), Østerby & Zlatev (1983) (this discusses, in particular, dynamic data structures; see also the technical report of Duff, 1980). Storage schemes used in connection to preconditioned iterative methods are considered in Saad (2003b). VBR and other sparse storage formats are described, for example, in the SPARSKIT library documentation of Saad (1994b). Buluc˛ et al. (2011) provide a good review and evaluation of storage formats for sparse matrices and their impact on primitive operations.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 2 Sparse Matrices and Their Graphs**

*The choice of data structure is one of the most important steps in algorithm design and implementation. Sparse matrix algorithms are no exception. The representation of a sparse matrix not only determines the efficiency of the algorithm, but also influences the algorithm design process –Buluc˛ et al. (2011).*

*Every sparse matrix problem is a graph problem and every graph problem is a sparse matrix problem –Gilbert et al. (2006).*

Many sparse matrix algorithms exploit the close relationship between matrices and graphs. We make no assumption regarding the reader's prior knowledge of graph theory. The purpose of this chapter is to summarize basic concepts from graph theory that will be exploited later and to establish the notation and terminology that will be used throughout.

### **2.1 Introduction to Graphs**

A graph G = *(*V*,* E*)* is a finite set V of **vertices** (or **nodes**), and a set E of **edges** defined as pairs of distinct vertices. When there is no distinction between the pairs of vertices *(u, v)* and *(v, u)*, the edges are represented by unordered pairs, and the graph is **undirected**. If, however, the pairs are ordered, the graph is a **directed graph**, or a **digraph**. Examples of simple graphs are given in Figures 2.1 and 2.2.

A labelling (or ordering) of a graph G = *(*V*,* E*)* with *n* vertices is a bijection of {1*,* 2*,...,n*} onto V. The integer *i* (1 ≤ *i* ≤ *n*) assigned to a vertex in V is called the **label** (or simply the **number**) of that vertex. Our standard choice of vertices will be V = {1*,...,n*} so that the vertices are directly identified by their labels.

G*<sup>s</sup>* = *(*V*s,* E*s)* is a **subgraph** of G = *(*V*,* E*)* if and only if V*<sup>s</sup>* ⊆ V and E*<sup>s</sup>* ⊆ E and *(us, vs)* ∈ E*<sup>s</sup>* implies *us, vs* ∈ V*s*. The subgraph is an **induced subgraph** if E*<sup>s</sup>* contains all the edges in E that have both *u* and *v* in V*s*. Two graphs G = *(*V*,* E*)*

**Figure 2.1** An example of an undirected graph.

**Figure 2.2** An example of a directed graph (digraph). The arrows indicate the direction of an edge. There are an edge *(*4 → 5*)* and an edge *(*5 → 4*)*.

and G*<sup>s</sup>* = *(*V*s,* E*s)* are **isomorphic** if there is a bijection *g* : V → V*<sup>s</sup>* that preserves adjacency, that is, *(u, v)* ∈ E if and only if *(g(u), g(v))* ∈ E*s*.

In an undirected graph, two vertices *u* and *v* in V are said to be **adjacent** (or **neighbours**) if *e* = *(u, v)* ∈ E; the edge *e* is **incident** to the vertex *u* and to the vertex *<sup>v</sup>*. We also use the notation *(u* ←→ *v)* for an edge (or *(u* <sup>G</sup> ←−→ *v)* to emphasize the edge belongs to the graph G). The **degree** *deg*G*(u)* of *u* ∈ V is the number of vertices in V that are adjacent to *u*, and the **adjacency set** *adj*G{*u*} is the set of these adjacent vertices (thus |*adj*G{*u*}| = *deg*G*(u)*). If V*<sup>s</sup>* is a subset of the vertices, then the adjacency set *adj*G{V*s*} is the set of vertices in V \ V*<sup>s</sup>* that are adjacent to at least one vertex in V*s*. A subgraph is a **clique** when every pair of vertices is adjacent. In the example in Figure 2.1, *deg*G*(*2*)* = 4 and *adj*G{2} = {1*,* 3*,* 4*,* 6}. The induced subgraph with vertices V*<sup>s</sup>* = {2*,* 4*,* 6} is a clique.

In a digraph, we use the notation *(u* <sup>→</sup> *v)* or *(u* <sup>G</sup>−−→ *v)* for a directed edge. There can be an edge *(u* → *v)* but no edge *(v* → *u)*. The adjacency set of *u* can be split into two parts

*adj*+ <sup>G</sup> {*u*}={*<sup>v</sup>* <sup>|</sup> *(u* <sup>→</sup> *v)* <sup>∈</sup> <sup>E</sup>} and *adj*<sup>−</sup> <sup>G</sup> {*u*}={*<sup>v</sup>* <sup>|</sup> *(v* <sup>→</sup> *u)* <sup>∈</sup> <sup>E</sup>}*.*

In the example given in Figure 2.2, *adj*+ <sup>G</sup> {2}={3*,* <sup>4</sup>} and *adj*<sup>−</sup> <sup>G</sup> {2} = 1.

#### **2.2 Walks, Paths, Cycles, and DAGs**

A sequence of *k* edges in an undirected graph G

*u*<sup>0</sup> ←→ *u*<sup>1</sup> ←→ *...* ←→ *uk*−<sup>1</sup> ←→ *uk*

is called a **walk** of length *k*. If G is a digraph, then the sequence

*u*<sup>0</sup> −→ *u*<sup>1</sup> −→ *...* −→ *uk*−<sup>1</sup> −→ *uk*

is a **directed walk**. The vertices *u*<sup>0</sup> and *uk* are connected by the walk, and for *k >* 0, *uk* is said to be **reachable** from *u*0; the set of vertices that are reachable from *u*<sup>0</sup> is denoted by R*each(u*0*)*. The walk is **closed** if *u*<sup>0</sup> = *uk*; a closed walk is called a **cycle**. Graphs that do not contain cycles are **acyclic**. A (directed) **trail** is a (directed) walk in which all the edges are distinct and a (directed) **path** is a (directed) trail in which all the vertices (and therefore also all the edges) are distinct. The **distance** between two vertices is the number of edges in the shortest path connecting them (this is also called the **length** of the path). In Figure 2.2, there is a path of length 4 from vertex 1 to vertex 7 but no path from vertex 7 to vertex 1.

In the undirected graph G = *(*V*,* E*)*, a path between a pair of its vertices with labels *i* and *j* is denoted by

$$i \xhookrightarrow{g}\_{j}$$

or, if it is clear which graph the path is in, by

$$i \iff j.$$

If all intermediate vertices on the path are less than min{*i, j* }, then the path is called a **fill-path** and is denoted by

$$i \xleftarrow{\mathcal{G}} \underset{min}{\Longleftrightarrow} j \quad \text{or} \quad i \xleftarrow{i \iff} j \dots$$

If all intermediate vertices on the path belong to a subset V*s*, then the path is denoted by

$$i \xleftrightarrow{\mathcal{G}}{\mathcal{V}\_{s}} j \quad \text{or} \quad i \xleftrightarrow{} \overline{\mathcal{V}\_{s}} j.$$

If G is a digraph, the double-sided arrow symbols are replaced by one-sided ones ⇒ in the direction of the edges. For example,

**Figure 2.3** An example of a DAG with two different topological orderings (see Section 4.4).

**Figure 2.4** An example of an undirected graph to illustrate reachability. If V*<sup>s</sup>* = {4*,* 5}, then R*each(*2*,* V*s)* = {1*,* 3*,* 6} and R*each(*6*,* V*s)* = {2*,* 3*,* 7}.

$$i \xrightarrow{\mathcal{Y}} j, \quad i \Longrightarrow j, \quad i \xrightarrow[min]{} j \quad \text{and} \quad i \xrightarrow[\mathcal{V}\_{\mathcal{I}}]{} j.$$

A very important special case of a digraph is one with no cycles. A directed acyclic graph is called as **DAG**. In a DAG, if there is a path *u* ⇒ *v* of nonzero length, then *u* is called an **ancestor** of *v* and *v* is said to be a **descendant** of *u*. Figure 2.3 depicts a DAG with two different orderings. For the labelling of the vertices on the left, vertices 2, 3, 5, and 6 are descendants of vertex 1, but only vertices 5 and 6 are descendants of vertex 4. Note that if the direction of each edge in a DAG is reversed, the resulting graph is also a DAG.

The notion of a **reachable set** is useful for the study of Gaussian elimination. Given a graph and a subset V*<sup>s</sup>* of its vertices, if *u* and *v* are two distinct vertices that do not belong to V*s*, then *v* is reachable from *u* through V*<sup>s</sup>* if *u* and *v* are connected by a path that is either of length 1 or is composed entirely of vertices that belong to V*<sup>s</sup>* (except for the endpoints *u* and *v*). Given V*<sup>s</sup>* and *u /*∈ V*s*, the reachable set R*each(u,* V*s)* is the set of all vertices that are reachable from *u* through V*s*. Note that if V*<sup>s</sup>* is empty or *u* does not belong to *adj*G*(*V*s)*, then R*each(u,* V*s)* = *adj*G*(u)*. A simple example is given in Figure 2.4.

#### **2.3 Trees, Components, and Connectivity**

An undirected graph is **connected** if every pair of vertices is connected by a path. A connected acyclic graph is called a **tree**, that is, a tree is an undirected graph in which any two vertices are connected by exactly one path. Every tree has at least two vertices of degree 1. Such vertices are called **leaf** vertices. A graph is a **forest** if it consists of a disjoint union of trees. This is illustrated in Figure 2.5.

If G is connected, then a **spanning tree** of G is a subgraph of G that is a tree containing every vertex of G. In general, a graph may have several spanning trees, but a graph that is not connected does not contain a spanning tree.

The concept of connectivity can be extended to the general case. A digraph G = *(*V*,* E*)* is **strongly connected** if for every pair of vertices *u, v* ∈ V there is a path from *u* to *v* and a path from *v* to *u*.

An **equivalence relation** defined for a collection of pairs of members of a set is a relation that satisfies three simple properties: reflexivity, symmetry, and transitivity. A key property of an equivalence relation on a set is that it induces a partitioning of the set. Strong connectivity is an equivalence relation on V. It induces a partitioning V = V<sup>1</sup> ∪ *...* ∪ V*<sup>s</sup>* such that each V*<sup>i</sup>* (1 ≤ *i* ≤ *s*) is strongly connected and is maximal with this property: no additional vertices from G can be included in V*<sup>i</sup>* without breaking its strong connectivity. The V*<sup>i</sup>* are called **strongly connected components** (or sometimes just **strong components**) of G.

Any undirected tree T = *(*V*,* E*)* can be converted into a **directed rooted tree** T = *(*V*,* E *)* by specifying a **root** vertex *r*. Note that *r* can be chosen arbitrarily: any choice gives a directed rooted tree. An edge *(u, v)* ∈ E becomes a directed edge *(u* → *v)* ∈ E if there is a path from *u* to *r* such that the first edge of this path is from *u* to *v*. Given *r*, this directed path is unique. We illustrate this transformation in Figure 2.6. *v* is called the **parent** of *u* if the directed edge *(u* → *v)* ∈ E ; *u* is said to be a **child** of *v* (two or more child vertices are referred to as **children**). Two vertices in a rooted tree are **siblings** if they have the same parent. Leaf vertices have no children. A rooted tree is a special case of a DAG.

**Figure 2.5** An example of an undirected graph with 12 vertices that is a forest (it consists of two disjoint trees). Vertices 1, 2, 3, 6, 7, 8, and 11 are leaf vertices.

**Figure 2.6** An example of an undirected tree T (left) and the rooted tree T (right) obtained from T by choosing the root *r* = 4. The arrows indicate the direction of the edges.

#### **2.4 Adjacency Graphs**

Adjacency graphs provide a link between sparse matrices and graphs. If *A* is a sparse matrix of order *n*, then an **adjacency graph** G*(A)* = *(*V*(A),* E*(A))* (often written simply as G) with *n* vertices V*(A)* = {1*,...,n*} can be associated with it. If *A* is structurally symmetric, then the edge set is E*(A)* = 

$$\mathcal{E}(A) = \left\{ (i, j) \mid a\_{lj} \neq 0, \ i \neq j \right\} \dots$$

A digraph can be associated with a nonsymmetric *A* by setting

$$\mathcal{E}(A) = \{ (i \to j) \mid a\_{ij} \neq 0, \ i \neq j \}.$$

Each diagonal nonzero *aii* corresponds to a loop or self-edge. They are generally omitted from G, and many algorithms that use G implicitly assume that the diagonal entries of *A* are present. Figure 2.7 depicts the sparsity patterns of two simple sparse matrices and their graphs. To capture not only the sparsity pattern of *A* but also the values of the entries, G can be transformed into a **weighted** graph using a mapping <sup>E</sup>*(A)* <sup>→</sup> <sup>R</sup> and/or <sup>V</sup>*(A)* <sup>→</sup> <sup>R</sup>.

A special case is the directed graph associated with a triangular matrix. If *L* is a lower triangular matrix and *U* is an upper triangular matrix, then the directed graphs G*(L)* and G*(U )* have edge sets

$$\mathcal{E}(L) = \{ (i \to j) \mid l\_{ij} \neq 0, \ i > j \} \text{ and } \mathcal{E}(U) = \{ (i \to j) \mid u\_{ij} \neq 0, \ i < j \}. \tag{2.1}$$

It is sometimes convenient to use <sup>G</sup>*(L<sup>T</sup> )* in which the direction of the edges is reversed

**Figure 2.7** An example of a structurally symmetric sparse matrix and its undirected graph (left) and a nonsymmetric sparse matrix and its digraph (right). Arrows indicate the direction of the edges in the digraph.

$$\mathcal{E}(L^T) = \{(j \to i) \mid l\_{lj} \neq 0, \ i > j\}. \tag{2.2}$$

It is straightforward to see that <sup>G</sup>*(L)*, <sup>G</sup>*(L<sup>T</sup> )*, and <sup>G</sup>*(U )* are DAGs; they are sometimes referred to as elimination DAGs.

#### **2.5 Matrix Permutations and Orderings**

In sparse matrix algorithms, permutations are important transformations. A **permutation matrix** *P* is a square matrix that has exactly one entry equal to unity in each row and column, and all remaining entries are zeros (that is, it is a permutation of the identity matrix). Premultiplying a matrix by *P* reorders the rows and postmultiplying by *P* reorders the columns. *P* can be represented by an integervalued **permutation vector** *p*, where *pi* is the column index of the unity within the *i*-th row of *P*. For example, ⎛⎝⎞⎠⎛⎝⎞⎠

$$P = \begin{pmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{pmatrix} \text{ and } \begin{array}{c} p = \begin{pmatrix} 2 \\ 3 \\ 1 \end{pmatrix} . \end{array}$$

The graph of a matrix *<sup>A</sup>* is unchanged if a symmetric permutation *<sup>A</sup>* <sup>=</sup> *P AP<sup>T</sup>* is performed, only the labelling (that is, the ordering) of the vertices changes, and

**Figure 2.8** An example of an arrowhead matrix and its undirected graph (left) and a symmetrically permuted arrowhead matrix and its undirected graph (right).

thus relabelling G*(A)* can be used to permute *A*. This invariance property is key in sparse matrix algorithms. As an example, consider the arrowhead matrix *A* and its graph G*(A)* given in Figure 2.8. The symmetrically permuted matrix *A* and G*(A )* are also shown, with *P* chosen such that the first row and column of *A* are the last row and column of *A* .

The digraph G of a general matrix *A* is not invariant under nonsymmetric permutations *P AQ*, with *<sup>Q</sup>* <sup>=</sup> *<sup>P</sup><sup>T</sup>* . A **topological ordering** of <sup>G</sup> is a labelling of its vertices such that for every edge *(i* → *j )*, vertex *i* precedes vertex *j* (i.e., *i<j* ). It can be shown that a topological ordering is possible if and only if G has no directed cycles, that is, it is a DAG. Any DAG has at least one topological ordering. The non-uniqueness of topological orderings of a DAG is shown in Figure 2.3.

#### **2.6 Lists, Stacks and Queues**

Sparse matrix algorithms frequently require the storage and manipulation of lists. A **list** is an ordered sequence of arbitrary elements

$$(\mu\_0, \mu\_1, \dots, \mu\_{k-1}, \mu\_k),\tag{2.3}$$

*u*<sup>0</sup> is the **head** of the list, and *uk* is its **tail**. An empty list is denoted by *()*.

A **stack** is a list in which elements can only be added to or removed from the head. A pointer locates the head of the stack. Let *S* = *(u*0*, u*1*,...,uk*−<sup>1</sup>*, uk)* be a stack. *push(S, v)* denotes adding *v* onto the stack by incrementing the pointer by one, giving *(v, u*0*,...uk)*. *pop(S, u*0*)* denotes the stack *(u*1*,...uk)* that results from decreasing the pointer by one (removing *u*<sup>0</sup> from the head). A **queue** is a list in which elements can be added to the tail (appended) or removed (popped) from the head. Consider the queue Q = *(u*0*, u*1*,...,uk*−<sup>1</sup>*, uk)*. The append operation *append(*Q*, uk*+1*)* results in the queue *(u*0*,...uk, uk*+1*)*, and the pop operation *pop(*Q*, u*0*)* results in the queue *(u*1*,...uk)*.

### **2.7 Graph Searches**

Many sparse matrix reordering algorithms involve searching the adjacency graph G*(A)*. The sequence in which the vertices are visited can be used, for example, to reorder the graph and hence permute the matrix. Given a start vertex, a **graph search** (also called a **graph traversal**) performs a step-by-step exploration of the vertices and edges of G*(A)*, generating sets of visited vertices and explored edges. Let V*<sup>v</sup>* be the set of visited vertices and V*<sup>n</sup>* be the set of vertices that have not yet been visited. Following some chosen rule, the search step selects an unexplored edge such that one of its vertices belongs to V*v*. If the other vertex belongs to V*n*, then this vertex is moved into V*v*, and the edge is flagged as explored. The explored edge may be directed or undirected; in an undirected graph, the edge *(u, v)* formally corresponds to the pair of edges *(u* → *v)* and *(v* → *u)*.

#### *2.7.1 Breadth-First Search*

Starting from a chosen start vertex *s*, a **breadth-first search** (BFS) explores all the vertices adjacent to *s*. It then explores all the vertices whose distance from *s* is 2, and then 3, and so on (that is, sibling vertices are visited before child vertices); a queue is used in its implementation. The search terminates when there are no unexplored edges *(u, v)* with *u* ∈ V*<sup>v</sup>* and *v* ∈ V*<sup>n</sup>* that are reachable from *s*. A simple example with *s* = 1 is given in Figure 2.9. All the vertices that are at the same distance from *s* are said to belong to the same **level** of the graph. At each level, the order in which the vertices are visited is not fixed.

#### *2.7.2 Depth-First Search*

A **depth-first search** (DFS) of a graph G visits child vertices before visiting sibling vertices; that is, it traverses the depth of a path before exploring its breadth. Starting

**Figure 2.9** An illustration of a BFS of a connected undirected graph, with the labels indicating the order in which the vertices are visited. Vertices 2*,* 3*,* 4*,* 5 are all at distance 1 from *s* and so belong to the first level; vertices 6*,* 7*,* 8 belong to the second level.

**Figure 2.10** An illustration of a DFS of a connected directed graph. The labels indicate the order in which the vertices are visited. The edges of the DFS spanning tree are in bold.

from a chosen vertex *s*, the set of vertices that are visited are those vertices *u* for which a directed path from *s* to *u* exists in G. This will give different results depending on *s* and how ties are broken. In the example given in Figure 2.10, the search works from left to right. Like the BFS, all vertices in R*each(s)* are visited. The edges that are traversed form a DFS spanning tree. In general, visiting all the edges of a graph results in a DFS forest that consists of exactly one DFS spanning tree for each connected component of the original graph. Thus the DFS can be used to compute connected components (see Algorithm 3.6).

There are a number of ways to construct the output vertex order for a DFS. In a **preorder** list, the vertices are returned in the order in which they are added into V*v*, while in a **postorder** list, the vertices are in the order in which they are last visited during the DFS algorithm (note that the reverse of a postordering is not the same as preordering). For the example in Figure 2.10, the vertices are added into V*<sup>v</sup>* in the order 1*,* 2*,* 3*,* 4*,* 5*,* 6*,* 7, and this is the preorder list. The sequence in which the DFS visits the vertices is 1*,* 2*,* 3*,* 2*,* 4*,* 2*,* 1*,* 5*,* 6*,* 5*,* 1*,* 7*,* 1. In this sequence, vertex 3 is the first vertex to appear for the last time so the postordering starts with vertex 3. The next vertex to appear for the last time is vertex 4, followed by vertex 2, and so on, resulting in the postorder list 3*,* 4*,* 2*,* 6*,* 5*,* 7*,* 1.

Algorithm 2.1 presents a DFS and outputs both the preorder and postorder lists. The call **dfs\_step** is made exactly once for each vertex *v*. Observe that if there is a path from vertex *v* to vertex *w* in the search tree, then *v* is labelled ahead of *w* in the preorder list and *w* is labelled ahead of *v* in postorder list.

#### **ALGORITHM 2.1 Find preorder and postorder lists using a DFS**

**Input:** Directed graph G = *(*V*,* E*)*. **Output:** Preorder list *preorder* and postorder list *postorder*.

```
1: Vv = ∅, preorder = () and postorder = ()
2: for all v ∈ V do
3: if v 
∈ Vv then
4: push(preorder, v)  Add v onto the preorder stack
5: Vv = Vv ∪ {v}  Add v to the set of visited vertices
6: dfs_step(v)
7: end if
8: end for
9: recursive function (dfs_step(v))
10: for all (v → w) ∈ E do
11: if w 
∈ Vv then
12: push(preorder, w)  Add w onto the preorder stack
13: Vv = Vv ∪ {w}  Add w to the set of visited vertices
14: dfs_step(w)  recursive search
15: end if
16: end for
17: push(postorder, v)  Add v onto the postorder stack
18: end recursive function
```
#### **2.8 Notes and References**

Graph theory has become an important mathematical tool in a wide variety of subjects, as well as being a mathematical discipline in its own right. There are many introductory textbooks. For example, the first four chapters of Wilson (1996) provide a basic foundation course, including definitions and examples of graphs, and the graduate-level textbook Bondy & Murty (2008) presents a coherent introduction to graph theory. The introductions to graphs given in computer science monographs such as Cormen et al. (2009) and Skiena (2020) are also ideal for our purposes.

Many papers that present sparse matrix algorithms employ graph concepts. Significant contributions include Parter (1961), Rose (1973), Rose et al. (1976), and Rose & Tarjan (1978). Important ideas first appeared in the published proceedings of some of the early conferences that focussed on sparse matrix computations, including Reid (1971), Rose & Willoughby (1972), Duff (1981), and Evans (1985). Much of the fundamental work from the 1960s and 1970s is given in the book by Tewarson (1973) and summarized later by Pissanetzky (1984). The general texts on sparse factorizations by George & Liu (1981), Davis (2006), and Duff et al. (2017) provide further sources of references and examples; see also Kepner & Gilbert (2011).

Discussions of data structures and graph searches can be found in Aho et al. (1983) and Tarjan (1983). The systematic analysis of the depth-first search algorithm is given in Tarjan (1972), but backtracking techniques on which this search is based were used even earlier in artificial intelligence and combinatorial optimization.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Introduction to Matrix Factorizations**

*If numerical analysts understand anything, surely it must be Gaussian elimination. This is the oldest and truest of numerical algorithms . . . This algorithm has been so successful that to many of us, Gaussian elimination and Ax* = *b are more or less synonymous. – Trefethen (1985).*

*Gaussian elimination is the standard method for solving a system of linear equations. As such, it is one of the most ubiquitous numerical algorithms and plays a fundamental role in scientific computation. – Higham (2011)*

This chapter introduces the basic concepts of Gaussian elimination and its formulation as a matrix factorization that can be expressed in a number of mathematically equivalent but algorithmically different ways.

Using unweighted graphs to capture the sparsity structures of matrices during Gaussian elimination is simplified by assuming that the result of adding, subtracting, or multiplying two nonzeros is nonzero. It follows that if *A* = *LU* and E*<sup>L</sup>* denotes the set of (directed) edges of the digraph G*(L)*, then for *i>j*

*aij* = 0 implies *(i* → *j )* ∈ E*L.*

This is the **non-cancellation assumption**. It allows the following observation.

**Observation 3.1** *The sparsity structures of the LU factors of A satisfy*

$$\mathcal{S}\{A\} \subseteq \mathcal{S}\{L+U\}.$$

*That is, the factors may contain entries that lie outside the sparsity structure of A. Such entries are termed filled entries, and together the filled entries are called the fill-in. The graph obtained from* G*(A) by adding the fill-in is called the filled graph.*

Numerical cancellations in LU factorizations rarely happen, and in general, they are difficult to predict, particularly in floating-point arithmetic. Thus, such

⎟

⎟

⎟

accidental zeros are not normally exploited in implementations, and we will ignore the possibility of their occurrence.

#### **3.1 Gaussian Elimination: An Overview**

The traditional way of describing Gaussian elimination is based on the systematic column-by-column annihilation of the entries in the lower triangular part of *A*. Assuming *A* is factorizable, this can be written formally as sequential multiplications by **column elimination matrices** that yield the **elimination sequence**

$$A = A^{(1)}, A^{(2)}, \dots, A^{(n)}\tag{3.1}$$

of partially eliminated matrices as follows:

$$A^{(1)} \to A^{(2)} = C\_1 A^{(1)} \to A^{(3)} = C\_2 C\_1 A^{(1)} \to \dots \to A^{(n)} = C\_{n-1} \dots C\_2 C\_1 A^{(1)} \dots$$

The unit lower triangular matrices *Ci* (1 ≤ *i* ≤ *n* − 1) are the column elimination matrices. Elementwise, assuming *<sup>a</sup>*<sup>11</sup> <sup>=</sup> *<sup>a</sup>(*1*)* <sup>11</sup> <sup>=</sup> 0, the first step *<sup>C</sup>*1*A(*1*)* <sup>=</sup> *A(*2*)* is ⎛⎜⎜⎜⎞⎟⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟

$$
\begin{pmatrix} 1 \\ -a\_{21}^{(1)}/a\_{11}^{(1)} & 1 \\ -a\_{31}^{(1)}/a\_{11}^{(1)} & 1 \\ \vdots & & 1 \\ -a\_{n1}^{(1)}/a\_{11}^{(1)} & & & 1 \end{pmatrix} \begin{pmatrix} a\_{11}^{(1)} & a\_{12}^{(1)} & \dots & a\_{1n}^{(1)} \\ a\_{21}^{(1)} & a\_{22}^{(1)} & \dots & a\_{2n}^{(1)} \\ a\_{31}^{(1)} & a\_{32}^{(1)} & \dots & a\_{3n}^{(1)} \\ \vdots & \vdots & \ddots & \vdots \\ a\_{n1}^{(1)} & a\_{n2}^{(1)} & \dots & a\_{nn}^{(1)} \end{pmatrix} = \begin{pmatrix} a\_{11}^{(1)} & a\_{12}^{(1)} & \dots & a\_{1n}^{(1)} \\ 0 & a\_{22}^{(2)} & \dots & a\_{2n}^{(2)} \\ 0 & a\_{32}^{(2)} & \dots & a\_{3n}^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & a\_{n2}^{(2)} & \dots & a\_{nn}^{(2)} \end{pmatrix},
$$

and provided *a(*2*)* <sup>22</sup> <sup>=</sup> 0, the second step *<sup>C</sup>*2*A(*2*)* <sup>=</sup> *A(*3*)* is ⎜⎜⎜⎟⎟⎟⎜⎜⎜⎟⎟⎟⎜⎜⎜

$$
\begin{pmatrix} 1 \\ & 1 \\ & -a\_{32}^{(2)}/a\_{22}^{(2)} & 1 \\ & & 1 \\ \vdots & & 1 \\ & -a\_{n2}^{(2)}/a\_{22}^{(2)} & & 1 \end{pmatrix} \begin{pmatrix} a\_{11}^{(1)} & a\_{12}^{(1)} & \dots & a\_{1n}^{(1)} \\ 0 & a\_{22}^{(2)} & \dots & a\_{2n}^{(2)} \\ 0 & a\_{32}^{(2)} & \dots & a\_{3n}^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & a\_{n2}^{(2)} & \dots & a\_{nn}^{(2)} \end{pmatrix} = \begin{pmatrix} a\_{11}^{(1)} a\_{12}^{(1)} & \dots & \dots & a\_{1n}^{(1)} \\ 0 & a\_{22}^{(2)} & \dots & \dots & a\_{2n}^{(2)} \\ 0 & 0 & a\_{33}^{(3)} & \dots & a\_{3n}^{(3)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & a\_{n3}^{(3)} & \dots & a\_{nn}^{(3)} \end{pmatrix}.
$$

The *k*-th **partially eliminated matrix** is *A(k)*. The **active entries** in *A(k)* are denoted by *a(k) ij* , 1 ≤ *k* ≤ *i, j* ≤ *n* (in the sparse case, many of the entries are zero), and the *(n* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* <sup>×</sup> *(n* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* submatrix of *A(k)* containing the active entries is termed its **active submatrix**. The graph associated with the active submatrix is the *<sup>k</sup>*-th **elimination graph** and is denoted by <sup>G</sup>*k*. If <sup>S</sup>{*A*} is nonsymmetric, then <sup>G</sup>*<sup>k</sup>* is a digraph.

The inverse of each *Ck* is the unit lower triangular matrix that is obtained by changing the sign of all the off-diagonal entries, and because the product of unit lower triangular matrices is a unit lower triangular matrix, it is clear that provided *a(k) kk* = 0 (1 ≤ *k<n*)

$$A = A^{(\mathbf{l})} = \mathcal{C}\_1^{-1} \mathcal{C}\_2^{-1} \dots \mathcal{C}\_{n-1}^{-1} A^{(n)} = LU,$$

where the unit lower triangular matrix *L* is the product *C*−<sup>1</sup> <sup>1</sup> *<sup>C</sup>*−<sup>1</sup> <sup>2</sup> *...C*−<sup>1</sup> *<sup>n</sup>*−<sup>1</sup> and *<sup>U</sup>* <sup>=</sup> *A(n)* is an upper triangular matrix. The subdiagonal entries of *L* are the negative of the subdiagonal entries of the matrix *C*<sup>1</sup> + *C*<sup>2</sup> + *...* + *Cn*−1*.* If *A* is a symmetric positive definite (SPD) matrix, then setting *<sup>U</sup>* <sup>=</sup> *DL<sup>T</sup>* , the LU factorization can be written as

$$A = LDL^T,$$

which is the square root-free Cholesky factorization. Alternatively, it can be expressed as the Cholesky factorization

$$A = (LD^{1/2})(LD^{1/2})^T,$$

where the lower triangular matrix *LD*1*/*<sup>2</sup> has positive diagonal entries.

The process of performing an LU factorization can be rewritten in the **generic form** given in Algorithm 3.1. Here each *lik* is called a **multiplier**, and the *a(k) kk* are called **pivots**. The assumption that *A* is factorizable implies *a(k) kk* = 0 for all *k*. Algorithm 3.1 comprises three nested loops. There are six ways of assigning the indices to the loops, with the loops having different ranges. The performance of the variants can differ significantly depending on the computer architecture. The key difference is the way the data are accessed from the factorized part of matrix and

### **ALGORITHM 3.1 Generic LU factorization Input:** Factorizable matrix *A*. **Output:** LU factorization *A* = *LU*.

```
1: for ————– do
2: for ————– do
3: for ————– do
4: lik = a(k)
               ik /a(k)
                  kk
5: a(k+1)
           ij = a(k)
                  ij − lika(k)
                        kj
6: end for
7: end for
8: end for
```
applied to the part that is not yet factorized. But in exact arithmetic, they result in the same *L* and *U*, which allows any of them to be used to demonstrate theoretical properties of LU factorizations. To identify the variants, names that derive from the order in which the indices are assigned to the loops can be used. The *kij* and *kj i* variants are called **submatrix LU factorizations**. The schemes *jik* and *jki* compute the factors by columns and are called **column factorizations**. The final two are **row factorizations** because they proceed by rows. A row factorization can be considered as a column LU factorization applied to *AT* .

#### *3.1.1 Submatrix LU Factorizations*

Each outermost step of the submatrix LU variants computes one row of *U* and one column of *L*. The first step (*k* = 1) is 

$$C\_1 A = \begin{pmatrix} 1 \\ -A\_{2:n,1}/a\_{11} & I \end{pmatrix} \begin{pmatrix} a\_{11} & A\_{1,2:n} \\ A\_{2:n,1} & A\_{2:n,2:n} \end{pmatrix} = \begin{pmatrix} a\_{11} & A\_{1,2:n} \\ & S \end{pmatrix},$$

where the *(n* − 1*)* × *(n* − 1*)* active submatrix

$$S = A\_{2:n,2:n} - A\_{2:n,1} \\ A\_{1,2:n}/a\_{11} = A\_{2:n,2:n} - L\_{2:n,1} \\ U\_{1,2:n}$$

is the **Schur complement** of *A* with respect to *a*11. If *A* is factorizable, then so too is *S* and the process can be repeated.

More generally, the operations performed at each step *k* correspond to a sequence of rank-one updates. The resulting Schur complement can be written in terms of entries of the matrices from the elimination sequence and entries of the computed factors. After *k*−1 steps (1 *< k* ≤ *n*), the *(n*−*k*+1*)*×*(n*−*k*+1*)* Schur complement of *A* with respect to its *(k* − 1*)* × *(k* − 1*)* principal leading submatrix is the active submatrix of the partially eliminated matrix *A(k)* given by ⎛⎜⎝⎞⎟⎠⎛⎜⎝⎞⎟⎠

$$\text{is the first } (k-1) \times (k-1) \text{ principal learning submanifolds is the active } \mathbf{a} \text{ given by the partially eliminated matrix } A^{(k)} \text{ given by}$$

$$S^{(k)} = \begin{pmatrix} a\_{kk} & \dots & a\_{kn} \\ \vdots & \ddots & \vdots \\ a\_{nk} & \dots & a\_{nn} \end{pmatrix} - \sum\_{j=1}^{k-1} \begin{pmatrix} l\_{kj} \\ \vdots \\ l\_{nj} \end{pmatrix} \begin{pmatrix} u\_{jk} & \dots & u\_{jn} \end{pmatrix}$$

$$= A\_{k:n,kn} - \sum\_{j=1}^{k-1} L\_{k:n,j} U\_{j,k:n}$$

$$= \begin{pmatrix} a\_{kk}^{(k)} & \dots & a\_{kn}^{(k)} \\ \vdots & \ddots & \vdots \\ a\_{nk}^{(k)} & \dots & a\_{nn}^{(k)} \end{pmatrix} = A\_{k:n,kn}^{(k)}.\tag{3.2}$$

If *A* is SPD, then the Cholesky and LDLT factorizations that are special cases of the submatrix approach are termed **right-looking** (fan-out) factorizations.

#### *3.1.2 Column LU Factorizations*

In the column LU factorization, the outermost index in Algorithm 3.1 is *j* . For *j* = 1, *l*<sup>11</sup> = 1, and the off-diagonal entries in column 1 of *L* are obtained by dividing the corresponding entries in column 1 of *A* by *u*<sup>11</sup> = *a*11. Assume *j* − 1 columns (1 *< j* ≤ *n*) of *L* and *U* have been computed. The partial column factorization can be expressed as 

$$
\begin{pmatrix} L\_{1:j-1,1:j-1} \\ L\_{j:n,1:j-1} \end{pmatrix} U\_{1:j-1,1:j-1} = \begin{pmatrix} A\_{1:j-1,1:j-1} \\ A\_{j:n,1:j-1} \end{pmatrix}.
$$

Column *j* of *U* and then column *j* of *L* are computed using the identities

$$U\_{\mathbf{l}:j-1,j} = L\_{\mathbf{l}:j-1,\mathbf{l}:j-1}^{-1} A\_{\mathbf{l}:j-1,j}, \quad u\_{jj} = a\_{jj} - L\_{j,\mathbf{l}:j-1} U\_{\mathbf{l}:j-1,j},$$

and

$$d\_{jj} = 1,\\ L\_{j+1:n,j} = (A\_{j+1:n,j} - L\_{j+1:n,1:j-1}U\_{1:j-1,j})/u\_{jj}.$$

Thus the strictly upper triangular part of column *j* of *U* is determined by solving the triangular system

$$L\_{1:j-1,1:j-1}U\_{1:j-1,j} = A\_{1:j-1,j},$$

and the strictly lower triangular part of column *j* of *L* is computed as a linear combination of column *Aj*+1:*n,j* of *A* and previously computed columns of *L*.

If *A* is symmetric and the pivots can be used in the order 1*,* 2*,...* without modification, then there is the following link between its column LU and LDLT factorizations.

**Observation 3.2** *The j -th diagonal entry djj (*1 ≤ *j* ≤ *n) of the LDLT factorization of the symmetric matrix A is*

$$\begin{array}{ll} \text{ $h$  diagonal entry  $d\_{jj}$  ( $1 \le i$ )}\\ \text{left matrix  $A$  is} \\\\ d\_{jj} = u\_{jj} = a\_{jj} - \sum\_{k=1}^{j-1} d\_{kk} l\_{jk}^2. \end{array}$$

*The L factor is the same as is computed by the column LU factorization; its computation can be written as*

#### **ALGORITHM 3.2 Basic column LU factorization with partial pivoting**

**Input:** Nonsingular nonsymmetric matrix *A*. **Output:** LU factorization *P A* = *LU*, where *P* is a row permutation matrix.

1: Interchange rows of *A* so that |*a*11| = max{|*ai*1| | 1 ≤ *i* ≤ *n*}

2: *l*<sup>11</sup> = 1*, u*<sup>11</sup> = *a*11*, L*2:*n,*<sup>1</sup> = *A*2:*n,*1*/a*<sup>11</sup> 3: **for** *j* = 2 : *n* **do**

$$\text{4:} \qquad \text{Solve } L\_{1:j-1,1:j-1} \\ U\_{1:j-1,j} = A\_{1:j-1,j}$$


8: **end for**

$$\begin{aligned} \beta\_{jj} L\_{j+1:n,j} &= A\_{j+1:n,j} - \sum\_{k=1}^{j-1} L\_{j+1:n,k} \, d\_{kk} \, l\_{jk} \dots \end{aligned}$$

*The U factor is equal to DL<sup>T</sup> . Computing L and D in this way is called the leftlooking (fan-in) factorization.*

So far, we have assumed that *A* is factorizable. If *A* is nonsingular, then there exists a row permutation matrix *P* such that *P A* is factorizable (Theorem 1.1), and if there are zeros on the diagonal, then the rows can always be permuted to achieve a nonzero diagonal. Consider the simple 2 × 2 matrix *A* and its LU factorization 1 1 

$$A = \begin{pmatrix} \delta & 1\\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 1\\ \delta^{-1} & 1 \end{pmatrix} \begin{pmatrix} \delta & 1\\ & 1 - \delta^{-1} \end{pmatrix}.$$

If *δ* = 0, this factorization does not exist, and if *δ* is very small, then the entries in the factors involving *δ*−<sup>1</sup> are very large. But interchanging the rows of *A*, we have 

$$PA = \begin{pmatrix} 1 & 1 \\ \delta & 1 \end{pmatrix} = \begin{pmatrix} 1 & \\ \delta & 1 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ & 1 - \delta \end{pmatrix},$$

which is valid for all *δ* = 1. Algorithm 3.2 presents a basic column LU factorization scheme for nonsingular *A*. The interchanging of rows at each elimination step to select the entry of largest absolute value in its column as the next pivot is called **partial pivoting**. It avoids small pivots and results in an LU factorization of a row permuted matrix *P A* in which the absolute value of each entry of *L* is at most 1. In practice, partial pivoting (or another pivoting strategy) is incorporated into all LU factorization variants. Pivoting strategies are discussed in Chapter 7.

#### *3.1.3 Factorizations by Bordering*

The generic LU factorization scheme does not cover all possible approaches. An alternative is **factorization by bordering**. Set all diagonal entries of *L* to 1, and assume the first *k* −1 rows of *L* and first *k* −1 columns of *U* (1 *< k* ≤ *n*) have been computed (that is, *L*1:*k*−1*,*1:*k*−<sup>1</sup> and *U*1:*k*−1*,*1:*k*−1). At step *k*, the factors must satisfy *Ak,*<sup>1</sup>:*k*−<sup>1</sup> *akk* <sup>0</sup> *ukk*

$$A\_{1:k,1:k} = \begin{pmatrix} A\_{1:k-1,1:k-1} & A\_{1:k-1,k} \\ A\_{k,1:k-1} & a\_{kk} \end{pmatrix} = \begin{pmatrix} L\_{1:k-1,1:k-1} & 0 \\ L\_{k,1:k-1} & 1 \end{pmatrix} \begin{pmatrix} U\_{1:k-1,1:k-1} & U\_{1:k-1,k} \\ 0 & u\_{kk} \end{pmatrix}.$$

Equating terms, the lower triangular part of row *k* of *L* and the upper triangular part of column *k* of *U* are obtained by solving

$$\begin{aligned} L\_{k,1:k-1} U\_{1:k-1,1:k-1} &= A\_{k,1:k-1}, \\ L\_{1:k-1,1:k-1} U\_{1:k-1,k} &= A\_{1:k-1,k}. \end{aligned}$$

The diagonal entry *ukk* is then given by

$$
\mu\_{kk} = a\_{kk} - L\_{k, 1:k-1} U\_{1:k-1, k} \quad \text{(with } \mu\_{11} = a\_{11}).
$$

#### **3.2 Fill-in in Sparse Gaussian Elimination**

Here we give some simple results that describe fill-in in the matrix factors; strategies to limit fill-in will be presented in Chapter 8. We start by looking at the rules that establish the positions of the entries in the factors. Assume S{*A*} is symmetric, and consider the elimination graph <sup>G</sup>*<sup>k</sup>* at step *<sup>k</sup>*. Its vertices are the *<sup>n</sup>* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup> uneliminated vertices. Its edge set contains the edges in G*(A)* connecting these vertices and additional edges corresponding to filled entries produced during the first *<sup>k</sup>*−<sup>1</sup> elimination steps. The sequence of graphs <sup>G</sup><sup>1</sup> <sup>≡</sup> <sup>G</sup>*(A),* <sup>G</sup>2*,...* is generated recursively using **Parter's rule**:

*To obtain the elimination graph* <sup>G</sup>*k*+<sup>1</sup> *from* <sup>G</sup>*k*, *delete vertex <sup>k</sup> and add all possible edges between vertices that are adjacent to vertex <sup>k</sup> in* <sup>G</sup>*k*.

Denoting <sup>G</sup>*<sup>k</sup>* <sup>=</sup> *(*V*k,* <sup>E</sup>*k)* and <sup>G</sup>*k*+<sup>1</sup> <sup>=</sup> *(*V*k*+1*,* <sup>E</sup>*k*+1*)*, this can be written as

$$\mathcal{V}^{k+1} = \mathcal{V}^k \backslash \{k\}, \; \mathcal{E}^{k+1} = \mathcal{E}^k \cup \{ (i, j) \mid i, j \in adj\_{\mathcal{G}^k} \{k\} \} \backslash \{ (i, k) \mid i \in adj\_{\mathcal{G}^k} \{k\} \}.$$

If S{*A*} is nonsymmetric, then the elimination graphs are digraphs and Parter's rule generalizes as follows:

*To obtain the elimination graph* <sup>G</sup>*k*+<sup>1</sup> *from* <sup>G</sup>*k*, *delete vertex <sup>k</sup> and add all edges (i* <sup>G</sup>*k*+<sup>1</sup> −−−→ *j ) such that (i* <sup>G</sup>*<sup>k</sup>* −→ *k) and (k* <sup>G</sup>*<sup>k</sup>* −→ *j )*.

**Figure 3.1** Illustration of Parter's rule. The original undirected graph <sup>G</sup> <sup>=</sup> <sup>G</sup><sup>1</sup> and the elimination graph <sup>G</sup><sup>2</sup> that results from eliminating vertex 1 are shown on the left and right, respectively. The red dashed lines denote fill edges. The vertices {2*,* 3*,* 4} become a clique.

**Figure 3.2** Illustration of Parter's rule for a nonsymmetric <sup>S</sup>{*A*}. The original digraph <sup>G</sup> <sup>=</sup> <sup>G</sup><sup>1</sup> and the directed elimination graph <sup>G</sup><sup>2</sup> that results from eliminating vertex 1 are shown on the left and right, respectively. The red dashed lines denote fill edges.

Simple examples are given in Figures 3.1 and 3.2.

In terms of graph theory, if S{*A*} is symmetric, then Parter's rule says that the adjacency set of vertex *k* becomes a clique when *k* is eliminated. Thus, Gaussian elimination systematically generates cliques. As the elimination process progresses, cliques grow or more than one clique join to form larger cliques, a process known as **clique amalgamation**. A clique with *m* vertices has *m(m*−1*)/*2 edges, but it can be represented by storing a list of its vertices, without any reference to edges. This enables important savings in both storage and data movement to be achieved during the symbolic phase of a direct solver.

The repeated application of Parter's rule specifies all the edges in <sup>G</sup>*(L* <sup>+</sup> *<sup>L</sup><sup>T</sup> )*:

*(i, j ) is an edge of* <sup>G</sup>*(L* <sup>+</sup> *<sup>L</sup><sup>T</sup> ) if and only if (i, j ) is an edge of* <sup>G</sup>*(A) or (i, k) and (k, j ) are edges of* <sup>G</sup>*(L* <sup>+</sup> *<sup>L</sup><sup>T</sup> ) for some k < i, j* .

This generalizes to a nonsymmetric matrix *A* and its LU factorization:

**Figure 3.3** Example to illustrate fill-in during the factorization of a symmetric matrix, with the eliminations performed in the natural order. <sup>S</sup>{*A*} and <sup>S</sup>{*<sup>L</sup>* <sup>+</sup> *LT* } are on the left and right, respectively, with the corresponding undirected graphs <sup>G</sup>*(A)* and <sup>G</sup>*(L* <sup>+</sup> *LT )*. Filled entries in *<sup>L</sup>* <sup>+</sup> *LT* are denoted by *<sup>f</sup>* . The red dashed lines in the filled graph <sup>G</sup>*(L* <sup>+</sup> *LT )* correspond to filled entries.

*(i* → *j ) is an edge of the digraph* G*(L* +*U ) if and only if (i* → *j ) is an edge of the digraph* G*(A) or (i* → *k) and (k* → *j ) are edges of* G*(L* + *U ) for some k < i, j* .

Parter's rule is a local rule that uses the dependency on nonzeros obtained in previous steps of the factorization. The following result, which uses the path notation of Section 2.2, fully characterizes the nonzero entries in the factors using only paths in G*(A)*.

#### **Theorem 3.1 (Rose et al. 1976; Rose & Tarjan 1978)**


*The fill-paths may not be unique.*

Figure 3.3 illustrates Theorem 3.1 for symmetric S{*A*}. There is a filled entry in position *(*8*,* <sup>6</sup>*)* of *<sup>L</sup>* because there is a fill-path <sup>8</sup> <sup>G</sup>*(A)* ⇐⇒ *min* 6 given by the sequence of (undirected) edges 8 ←→ 2 ←→ 5 ←→ 1 ←→ 6.

Corollary 3.2 characterizes edges of <sup>G</sup>*<sup>k</sup>* in terms of reachable sets in the original graph G*(A)*.

**Figure 3.4** An example to illustrate reachable sets in G*(A)*. The grey vertices 1, 2, and 3 are eliminated in the first three elimination steps *(*V<sup>4</sup> = {1*,* <sup>2</sup>*,* <sup>3</sup>}*)*.

#### **Corollary 3.2 (Rose et al., 1976; George & Liu, 1980b)**

*Assume* <sup>S</sup>{*A*} *is symmetric. Let* <sup>V</sup>*<sup>k</sup> be the set of <sup>k</sup>* <sup>−</sup> <sup>1</sup> *vertices of* <sup>G</sup>*(A) that have already been eliminated, and let <sup>v</sup> be a vertex in the elimination graph* <sup>G</sup>*k. Then the set of vertices adjacent to <sup>v</sup> in* <sup>G</sup>*<sup>k</sup> is the set* <sup>R</sup>*each(v,* <sup>V</sup>*k) of vertices reachable from <sup>v</sup> through* <sup>V</sup>*<sup>k</sup> in* <sup>G</sup>*(A).*

*Proof* The proof is by induction on *k*. The result holds trivially for *k* = 1 because <sup>R</sup>*each(v,* <sup>V</sup>1*)* <sup>=</sup> *adj*G*(A)*{*v*}. Assume the result holds for <sup>G</sup>1*,...,* <sup>G</sup>*<sup>k</sup>* with *<sup>k</sup>* <sup>≥</sup> 1, and let *<sup>v</sup>* be a vertex in the graph <sup>G</sup>*k*+<sup>1</sup> that is obtained after eliminating *vk* from <sup>G</sup>*k*. If *<sup>v</sup>* is not adjacent to *vk* in <sup>G</sup>*k*, then <sup>R</sup>*each(v,* <sup>V</sup>*k*+1*)* <sup>=</sup> <sup>R</sup>*each(v,* <sup>V</sup>*k)*. Otherwise, if *<sup>v</sup>* is adjacent to *vk* in <sup>G</sup>*k*, then *adj*G*k*+<sup>1</sup> {*v*} = <sup>R</sup>*each(v,* <sup>V</sup>*k)* <sup>∪</sup> <sup>R</sup>*each(vk,* <sup>V</sup>*k)*. In both cases, Parter's rule implies that the new adjacency set is exactly equal to the vertices that are reachable from *<sup>v</sup>* through <sup>V</sup>*k*+1, that is, <sup>R</sup>*each(v,* <sup>V</sup>*k*+1*)*.

Figure 3.4 depicts a graph <sup>G</sup>*(A)*. The adjacency sets of the vertices in <sup>G</sup><sup>4</sup> that result from eliminating vertices <sup>V</sup><sup>4</sup> = {1*,* <sup>2</sup>*,* <sup>3</sup>} are *adj*G<sup>4</sup> {4} = <sup>R</sup>*each(*4*,* <sup>V</sup>4*)* <sup>=</sup> {5}, *adj*G<sup>4</sup> {5} = <sup>R</sup>*each(*5*,* <sup>V</sup>4*)* = {4*,* <sup>6</sup>*,* <sup>7</sup>}, *adj*G<sup>4</sup> {6} = <sup>R</sup>*each(*6*,* <sup>V</sup>4*)* = {5*,* <sup>7</sup>}, *adj*G<sup>4</sup> {7} = <sup>R</sup>*each(*7*,* <sup>V</sup>4*)* = {5*,* <sup>6</sup>*,* <sup>8</sup>}, and *adj*G<sup>4</sup> {8} = <sup>R</sup>*each(*8*,* <sup>V</sup>4*)* = {7}.

We remark that neither the local characterization of filled entries using Parter's rule nor Theorem 3.1 provides a direct answer as to whether a certain edge belongs to <sup>G</sup>*(L*+*L<sup>T</sup> )* (or <sup>G</sup>*(L*+*U )*); without performing the eliminations, they do not tell us whether a given entry of a factor of *A* is nonzero. Such questions are addressed by deeper theoretical and algorithmic results that are presented in subsequent chapters.

#### **3.3 Triangular Solves**

Once an LU factorization has been computed, the solution *x* of the linear system *Ax* = *b* is computed by solving the lower triangular system

$$L\mathbf{y} = b,\tag{3.3}$$

followed by the upper triangular system

$$Ux = \text{y}.\tag{3.4}$$

Solving a system with a triangular matrix and dense right-hand side vector is straightforward. The solution of (3.3) can be computed using **forward substitution** in which the component *y*<sup>1</sup> is determined from the first equation, substitute it into the second equation to obtain *y*2, and so on. Once *y* is available, the solution of (3.4) can be obtained by **back substitution** in which the last equation is used to obtain *xn*, which is then substituted into equation *n*−1 to obtain *xn*−1, and so on. Algorithm 3.3 is a simple lower triangular solve for dense *b*. If *L* is unit lower triangular, step 3 is not needed.

#### **ALGORITHM 3.3 Forward substitution: lower triangular solve** *Ly* = *b* **with** *b* **dense**

**Input:** Lower triangular matrix *L* with nonzero diagonal entries and dense righthand side *b*.

**Output:** The dense solution vector *y*.

1: Initialise *y* = *b* 2: **for** *j* = 1 : *n* **do** 3: *yj* = *yj /ljj* 4: **for** *i* = *j* + 1 : *n* **do** 5: **if** *lij* = 0 **then** 6: *yi* = *yi* − *lij yj* 7: **end if** 8: **end for** 9: **end for**

When *b* is sparse, the solution *y* is also sparse. In particular, if in Algorithm 3.3 *yk* = 0, then the outer loop with *j* = *k* can be skipped. Furthermore, if *b*<sup>1</sup> = *b*<sup>2</sup> = *...* = *bk* = 0 and *bk*+<sup>1</sup> = 0, then *y*<sup>1</sup> = *y*<sup>2</sup> = *...* = *yk* = 0. Scanning *y* to check for zeros adds *O(n)* to the complexity. But if the set of indices J = {*j* | *yj* = 0} is known beforehand, then Algorithm 3.3 can be replaced by Algorithm 3.4. A possible way to determine J is discussed later (Theorem 5.2).

Note that the combined effect of forward substitution (3.3) followed by back substitution (3.4) often results in the final solution vector *x* being dense. This is the case if *yn* = 0 and *U* has an entry in each off-diagonal row *i* (1 ≤ *i<n*).

#### **3.4 Reducibility and Block Triangular Forms**

The performance of algorithms for computing factorizations of sparse matrices can frequently be significantly enhanced by first permuting *A* to have a block form or by

#### **ALGORITHM 3.4 Forward substitution: lower triangular solve** *Ly* = *b* **with** *b* **sparse**

**Input:** Lower triangular matrix *L* with nonzero diagonal entries, sparse vector *b* and the set J of indices *j* such that *yj* = 0.

**Output:** The sparse solution vector *y*.

1: Initialise *y* = *b* 2: **for** *j* ∈ J **do** Take indices from J in increasing order 3: *yj* = *yj /ljj* 4: **for** *i* = *j* + 1 : *n* **do** 5: **if** *lij* = 0 **then** 6: *yi* = *yi* − *lij yj* 7: **end if** 8: **end for** 9: **end for**

partitioning *A* into blocks. Permuting to block form is closely connected to matrix reducibility. *A* is said to be **reducible** if there is a permutation matrix *P* such that 

$$PAP^T = \begin{pmatrix} A\_{p1,p1} & A\_{p1,p2} \\ 0 & A\_{p2,p2} \end{pmatrix},$$

where *Ap*1*,p*<sup>1</sup> and *Ap*2*,p*<sup>2</sup> are nontrivial square matrices (that is, they are of order at least 1). If *A* is not reducible, it is **irreducible**. If *A* is structurally symmetric, then *Ap*1*,p*<sup>2</sup> <sup>=</sup> <sup>0</sup> and *P AP<sup>T</sup>* is block diagonal. The following example illustrates that a one-sided permutation can transform an irreducible matrix *A* into a reducible matrix *AQ*. ⎛⎝⎞⎠⎛⎝⎞⎠⎛⎝⎞⎠

$$A = \begin{pmatrix} 1 & 1 & 1 \\ 1 & 1 \\ 1 \end{pmatrix}, \quad \mathcal{Q} = \begin{pmatrix} & & 1 \\ & 1 & \\ 1 \end{pmatrix}, \quad A\mathcal{Q} = \begin{pmatrix} 1 & 1 & 1 \\ & 1 & 1 \\ & & 1 \end{pmatrix}.$$

A matrix *A* is said to be a **Hall matrix** (or has the **Hall property**) if every set of *k* columns has nonzeros in at least *k* rows (1 ≤ *k* ≤ *n*). *A* is a **strong Hall matrix** (or has the **strong Hall property**) if every set of *k* columns (1 ≤ *k<n*) has nonzeros in at least *k* + 1 rows. The strong Hall property trivially implies the Hall property. The Hall property applies to rectangular *m* × *n* matrices with *m* ≥ *n*. If *A* is square, then *A* has the strong Hall property if and only if the directed graph G*(A)* is strongly connected.

The following theorem is an important consequence of reducibility.

#### **Theorem 3.3 (Brualdi & Ryser 1991)**

*Given a nonsingular nonsymmetric matrix A, there exists a permutation matrix P such that*

⎛ ⎜⎜⎜⎜⎜⎝ 123456 <sup>1</sup> ∗∗ ∗∗ <sup>2</sup> ∗ ∗ <sup>3</sup> ∗∗ ∗∗ <sup>4</sup> ∗ ∗ <sup>5</sup> ∗ ∗ <sup>6</sup> ∗∗ ∗ ⎞ ⎟⎟⎟⎟⎟⎠ ⎛ ⎜⎜⎜⎜⎜⎝ 635412 <sup>6</sup> ∗∗ ∗ <sup>3</sup> ∗∗∗ ∗ <sup>5</sup> ∗ ∗ <sup>4</sup> ∗ ∗ <sup>1</sup> ∗∗∗∗ <sup>2</sup> ∗ ∗ ⎞ ⎟⎟⎟⎟⎟⎠

**Figure 3.5** The sparsity patterns of *A* (left) and the upper block triangular form *P AP<sup>T</sup>* with two blocks *Aib,ib*, *i* = 1*,* 2, of orders 2 and 4 (right). ⎛⎜⎞⎟

⎜

⎜

⎝

$$PAP^T = \begin{pmatrix} A\_{1,1} & A\_{1,2} & \cdots & A\_{1,nb} \\ 0 & A\_{2,2} & \cdots & A\_{2,nb} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & A\_{nb,nb} \end{pmatrix},\tag{3.5}$$

⎟

⎟

⎠

*where the square matrices Aib,ib on the diagonal are irreducible. The set* {*Aib,ib* | 1 ≤ *ib* ≤ *nb*} *is uniquely determined (but the blocks may appear on the diagonal in a different order). The order of the rows and columns within each Aib,ib may not be unique.*

The **upper block triangular form** (3.5) is also known as the **Frobenius normal form**. It is said to be nontrivial if *nb >* 1, and this is the case if *A* does not have the strong Hall property. An example of a matrix that can be symmetrically permuted to block triangular form with *nb* = 2 is given in Figure 3.5.

In practice, many of the blocks in (3.5) are either sparse or zero blocks. Assuming the blocks *Aib,ib* on the diagonal are all nonsingular, an LU factorization of each can be computed independently. These can then be used to solve the permuted system *P AP<sup>T</sup> <sup>y</sup>* <sup>=</sup> *<sup>c</sup>* as a sequence of *nb* smaller problems, as outlined in Algorithm 3.5. The solution of the original system *Ax* = *b* follows by setting *c* = *P b* and *x* = *P<sup>T</sup> y*. Because the algorithms used to transform *A* into a block triangular form are typically graph-based (and do not use the numerical values of the entries of *A*), pivoting needs to be incorporated within the factorization of the diagonal blocks. Algorithm 3.5 employs partial pivoting for this.

The transversal of a matrix *A* is the set of its nonzero diagonal elements. *A* has a **full** or **maximum transversal** if all its diagonal entries are nonzero. There exist permutation matrices *P* and *Q* such that *P AQ* has a full transversal matrix if and only if *A* has the Hall property. Moreover, if *A* is nonsingular, then it can be nonsymmetrically permuted to have a full transversal. However, the converse is clearly not true (for example, a matrix with all its entries equal to one has a full transversal, but it is singular). Permuting *A* to have a full transversal will be discussed in Section 6.3.

If *A* has a full transversal, then there exists a permutation matrix *Ps* such that *PsAP<sup>T</sup> <sup>s</sup>* has the form (3.5). In other words, once *A* has a full transversal, a symmetric permutation is sufficient to obtain the form (3.5). Finding *Ps* is identical **ALGORITHM 3.5 Solve a sparse linear system in upper block triangular form Input:** Upper block triangular matrix (3.5) and a conformally partitioned right-hand side vector *c*.

**Output:** The conformally partitioned solution vector *y*.


to finding the strongly connected components (SCCs) of the digraph G*(A)* = *(*V*,* E*)* (Section 2.3). To find the SCCs, V is partitioned into non-empty subsets V*<sup>i</sup>* with each vertex belonging to exactly one subset. Each vertex *i* in the **quotient graph** corresponds to a subset V*i*, and there is an edge in the quotient graph with endpoints *i* and *j* if E contains at least one edge with one endpoint in V*<sup>i</sup>* and the other in V*<sup>j</sup>* . The **condensation** (or component graph) of a digraph is a quotient graph in which the SCCs form the subsets of the partition, that is, each SCC is contracted to a single vertex. This reduction provides a simplified view of the connectivity between components. An example is given in Figure 3.6. It has five SCCs: {*p, q, r*}, {*s,t, u*}, {*v*}, {*w*}, and {*x*}.

The following result gives the relationship between SCCs and DAGs.

#### **Theorem 3.4 (Sharir 1981; Cormen et al. 2009)**

*The condensation* G*<sup>C</sup> of a digraph is a DAG (directed acyclic graph).*

Because any DAG can be topologically ordered, G*<sup>C</sup>* = *(*V*C,* E*C)* can be topologically ordered, and if V*<sup>i</sup>* and V*<sup>j</sup>* are contracted to *si* and *sj* and *(si* −→ *sj )* ∈ E*C*, then *si < sj* . It follows that to permute *A* to block triangular form it is sufficient to find the SCCs of G*(A)*. That is, topologically ordering the vertices of the condensation G*<sup>C</sup>* induced by the SCCs is the quotient graph that implies the block triangular form. There are many ways to find SCCs, one of which is Tarjan's algorithm (Algorithm 3.6). The key idea here is that vertices of an SCC form a subtree in the DFS spanning tree of the graph. The algorithm performs depthfirst searches, keeping track of two properties for each vertex *v*: when *v* was first

**Figure 3.6** An illustration of the strong components of a digraph. On the left, the five SCCs are denoted using different colours and on the right is the condensation DAG G*<sup>C</sup>* formed by the SCCs.

encountered (held in *invorder(v)*) and the lowest numbered vertex that is reachable from *v* (called the low-link value and held in *lowlink(v)*). It pushes vertices onto a stack as it goes and outputs a SCC when it finds a vertex for which *invorder(v)* and *lowlink(v)* are the same. The value *lowlink(v)* is computed during the DFS from *v*, as this finds the vertices that are reachable from *v*.

In Algorithm 3.6, the variable *index* is the DFS vertex number counter that is incremented when an unvisited vertex is visited. *S* is the vertex stack. It is initially empty and is used to store the history of visited vertices that are not yet committed to an SCC. Vertices are added to the stack in the order in which they are visited. The outermost loop of the algorithm visits each vertex that has not yet been visited, ensuring vertices that are not reachable from the starting vertex are eventually visited. The recursive function **scomp\_step** performs a single DFS, finding all descendants of vertex *v*, and reporting all SCCs for that subgraph. When a vertex *v* finishes recursing, if *lowlink(v)* = *invorder(v)*, then it is the root vertex of an SCC comprising all of the vertices above it on the stack. The algorithm pops the stack up to and including *v*; these popped vertices form an SCC. The algorithm is linear in the number of edges and vertices, that is, it is of complexity *O(*|V|+|E|*)*.

#### **3.5 Block Partitioning**

In this section, we assume that S{*A*} is symmetric and G = *(*V*,* E*)* is the adjacency graph of *A*.

### **ALGORITHM 3.6 Tarjan's algorithm to find the strongly connected components (SCCs) of a digraph**

```
Input: Digraph G = (V, E).
```
**Output:** Strongly connected components of G, determined one-by-one.

```
1: Vv = ∅, S = (), index = 0,  Each vertex is initially unvisited
2: for each v ∈ V do
3: if v 
∈ Vv then
4: scomp_step(v)
5: end if
6: end for
7: recursive function (scomp_step(v))
8: Vv = Vv ∪ {v}  Add v to the set of visited vertices
9: index = index + 1  Set the index for v to smallest unused index
10: invorder(v) = index, lowlink(v) = index
11: push(S, v)  Put v on the stack
12: Set v = head(S)  v is the current head of S.
13: for each (v → w) ∈ E do  Look in the adjacency list of v
14: if w 
∈ Vv then  w not yet been visited; recurse on it
15: scomp_step(w)
16: lowlink(v) = min(lowlink(v), lowlink(w))
17: else if w ∈ S then  w is in the stack and hence in current SCC
18: lowlink(v) = min(lowlink(v), invorder(w))
19: end if
20: end for
21: if lowlink(v) = invorder(v) then
22: pop all vertices down to v from S to obtain a new SCC
23: end if
24: end recursive function
```
#### *3.5.1 Block Structure Based on Supervariables*

Sets of columns of *A* frequently have identical sparsity patterns. For instance, when *A* arises from a finite element discretization, the columns corresponding to variables that belong to the same set of finite elements have the same pattern, and this occurs as a result of each node of the finite element mesh having multiple degrees of freedom associated with it. This repetition of the sparsity patterns can be used to substantially enhance performance.

Adjacent vertices *u* and *v* in an undirected graph G = *(*V*,* E*)* are said to be **indistinguishable** if they have the same neighbours, that is, *adj*G{*u*}∪{*u*} = *adj*G{*v*}∪{*v*}. A set of mutually indistinguishable vertices is called an **indistinguishable vertex set**. If U ⊆ V is an indistinguishable vertex set, then U is **maximal** if U ∪ {*w*} is not indistinguishable for any *w* ∈ V \ U.

Indistinguishability is an equivalence relation on V, and maximal indistinguishable vertex sets represent its classes. This implies a partitioning of V into *nsup* ≥ 1 non-empty disjoint subsets

$$\mathcal{V} = \mathcal{V}\_1 \cup \mathcal{V}\_2 \cup \dots \cup \mathcal{V}\_{nsup}. \tag{3.6}$$

An indistinguishable vertex set can be represented by a single vertex, called a **supervariable**.

If the vertices belonging to each subset V1*,...,* V*nsup* are numbered consecutively, with those in V*<sup>i</sup>* preceding those in V*i*+<sup>1</sup> (1 ≤ *i < nsup*), and if *P* is the permutation matrix corresponding to this ordering, then the permuted matrix *P AP<sup>T</sup>* has a block structure in which the blocks are dense (with the possible exception of the diagonal entries, which can be zero); the dimensions of the blocks are equal to the sizes of the indistinguishable sets.

One approach for identifying supervariables is outlined in Algorithm 3.7. Initially, all the vertices are placed in a single vertex set (that is, into a single supervariable). This is split into two supervariables by taking the first vertex *j* = 1 and moving vertices in the adjacency set of *j* into a new vertex set (a new supervariable). Each vertex *j* is considered in turn, and each vertex set V*sv* that contains a vertex in *adj*G{*j* } ∪ *j* is split into two by moving the vertices in *adj*G{*j* } ∪ *j* that belong to V*sv* into a new vertex set. Note that as a result of the splitting and moving of vertices, a vertex set can become empty, in which case it is discarded. Once the supervariables have been determined, the permuted matrix *P AP<sup>T</sup>* can be condensed to a matrix of order equal to *nsup*; the corresponding graph is called the **supervariable** graph. If the average number of variables in each supervariable is *k*, using the supervariable graph will reduce the amount of integer data that is read during the symbolic phase by a factor of about *k*2.

As an illustration, consider the following 5 × 5 matrix ⎛⎞

⎜

$$
\begin{array}{ccccc}
1 & 2 & 3 & 4 & 5 \\
2 & \begin{pmatrix} \* & \* & & & \* \\ \* & \* & & & \* \\ & \* & \* & \* & \\ & & \* & \* & \* \\ & & \* & \* & \* \\ \* & \* & \* & \* & \* \\ \end{array} \\
\end{array}
$$

⎟

Initially, 1*,* 2*,* 3*,* 4*,* 5 are put into a single vertex set V1. Consider *j* = 1. Vertices *i* = 1*,* 2 and 5 belong to *adj*G{1}∪{1}; they are moved from V<sup>1</sup> into a new vertex set. There is no further splitting of the vertex sets for *j* = 2. For *j* = 3, *adj*G{3}∪{3} = {3*,* 4*,* 5}. Vertices *i* = 3 and 4 are moved from V<sup>1</sup> into a new vertex set. V<sup>1</sup> is now empty and can be discarded. Vertex *i* = 5 is moved from the vertex set that holds vertices 1 and 2 into a new vertex set. For *j* = 4 and 5, no additional splitting is performed. Thus, three supervariables are found, namely {1*,* 2}, {3*,* 4}, and {5}.

## **ALGORITHM 3.7 Find the supervariables of an undirected graph Input:** Graph G of a symmetrically structured matrix. **Output:** Partitioning of V into indistinguishable vertex sets.

```
1: V1 = {1, 2,...,n}
2: for j = 1 : n do
3: for i ∈ adjG{j } ∪ j do
4: Find sv such that i ∈ Vsv
5: if this is the first occurrence of sv for the current index j then
6: Establish a new vertex set Vnsv and move i from Vsv to Vnsv
7: else
8: Move i from Vsv to Vnsv
9: end if
10: Discard Vsv if it is empty
11: end for
12: end for
```
#### *3.5.2 Block Structure Using Symbolic Dot Products*

An alternative way to find a block structure uses symbolic dot products between the rows of the matrix. While fully dense blocks can be found this way, it can also be used to determine an approximate block structure in which blocks are classified as dense or sparse based on a chosen threshold; this can be useful in preconditioning iterative methods. Although we assume that S{*A*} is symmetric, modifications can extend the approach to general nonsymmetric *A*. *A* = 

Rewrite *A* as row vectors

$$A = \begin{pmatrix} a\_1^T, \dots, a\_n^T \end{pmatrix}^T, \text{where } a\_i^T = A\_{i, 1:n},$$

and consider G*(A)* = *(*V*,* E*)*. A partition V = V<sup>1</sup> ∪ *...* ∪ V*nb* is constructed using row products *a<sup>T</sup> <sup>i</sup> ak* between different rows of *A*. These express the level of orthogonality between the rows; if *a<sup>T</sup> <sup>i</sup> ak* is small, then *i* and *k* are assigned to different vertex sets. Algorithm 3.8 treats all entries of *A* as unity, and the symbolic row products can be considered as a generalization of the angles between rows expressed by their cosines, hence the notation *cosine* for the vector that stores these products. The vertex sets are described using the vector *adjmap*. On output, if *adjmap(i*1*)* = *adjmap(i*2*)*, then vertices *i*<sup>1</sup> and *i*<sup>2</sup> belong to the same vertex set. Symmetry of S{*A*} simplifies the computation of the symbolic row products because for row *i* only *k>i* is considered, that is, only the symbolic row products that correspond to one triangle of *AT A* are checked.

The procedure outlined in Algorithm 3.8 and illustrated in Figure 3.7 is controlled by a threshold parameter *τ* ∈ *(*0*,* 1]. *j* is added to the subset to which *i*

#### **ALGORITHM 3.8 Find approximately indistinguishable vertex sets in an undirected graph**

**Input:** Graph G = *(*V*,* E*)* of a symmetrically structured matrix *A*, the number *nzi* of entries in row *i* of *A* (1 ≤ *i* ≤ *n*), and a threshold parameter *τ* ∈ *(*0*,* 1]. **Output:** Partitioning of V into *nb* disjoint approximately indistinguishable vertex sets.

```
1: nb = 0, adjmap(1 : n) = 0, cosine(1 : n) = 0
2: for i = 1 : n do
3: if adjmap(i) = 0 then
4: nb = nb + 1  Start a new set
5: adjmap(i) = ib
6: for (i, j ) ∈ E do  Corresponds to an entry in Ai,1:n
7: for (k, j ) ∈ E with k>i do  Both rows i and k have an entry in
                                  column j
8: if adjmap(k) = 0 then  k has not been yet added to some
                                  partitioning set
9: cosine(k) = cosine(k) + 1  Increase partial dot product
10: end if
11: end for
12: for k with cosine(k) 
= 0 do
13: if cosine(k)2 ≥ τ 2 ∗ nzi ∗ nzk then  Test similarity of row
                                           patterns
14: adjmap(k) = nb
15: end if
16: cosine(k) = 0
17: end for
18: end for
19: end if
20: end for
```
belongs if the cosine of the angle between them exceeds *τ* . If *τ <* 1, the block structure depends on the order in which the rows are processed, while *τ* = 1 gives the exact indistinguishable vertex sets because, in this case, the row patterns being compared must be the identical for the rows to be assigned to the same set.

### **3.6 Notes and References**

A standard description of LU factorizations based on the generic scheme given in Algorithm 3.1 can be found in the classical book by Ortega (1988b); this includes the


**Figure 3.7** An example to illustrate Algorithm 3.8. The original matrix is given (left) together with the permuted matrix with indistinguishable vertex sets V = {1*,* 3}∪{2*,* 6}∪{4}∪{5} obtained using *τ* = 1 (centre) and the permuted matrix with approximately indistinguishable vertex sets V = {1*,* 3*,* 5}∪{2*,* 6}∪{4} obtained using *τ* = 0*.*5 (right). The threshold *τ* = 0*.*5 results in putting row 5 into the same set as row 1, making the vertex sets only approximately indistinguishable. The permuted matrix on the right has an approximate block form.

symmetric case and discusses early parallelization issues (which are also considered in the review of Dongarra et al. (1984)). A more algorithmically oriented approach is given in Golub & Van Loan (1996). For the column variant with partial pivoting, we recommend the detailed description of the sparse case in Gilbert & Peierls (1988). Many results for sparse LU factorizations are surveyed by Gilbert & Ng (1993) and Gilbert (1994). Pothen & Toledo (2004) consider both symmetric and nonsymmetric matrices in their survey of graph models of sparse elimination. The review by Davis et al. (2016) provides many further references.

Parter (1961) presents Parter's rule, and its nonsymmetric version is given in Haskins & Rose (1973). Building on the paper of Rose et al. (1976), Rose & Tarjan (1978) were the first to methodically consider the symbolic structure of Gaussian elimination for nonsymmetric matrices. Related work is included in the seminal paper on Cholesky factorizations by Liu (1986). Fill-in rules in the general context of Schur complements in LU factorizations can be found in Eisenstat & Liu (1993b).

Classical and detailed treatments of triangular solves that also cover sparse issues are given in the papers Brayton et al. (1970), Gilbert & Peierls (1988), and Gilbert (1994). For reducibility theory that is closely connected to the general theory of matrices, see Brualdi & Ryser (1991), which includes, for example, a proof of Theorem 3.4.

Algorithm 3.6 for computing strongly connected components of a digraph is introduced in Tarjan (1972); see also Sharir (1981) and Duff & Reid (1978) for an early implementation.

For identifying supervariables, Algorithm 3.7 follows Reid & Scott (1999), but see also Ashcraft (1995) and Hogg & Scott (2013a) (the latter presents an efficient variant that employs a stack). The approximate block partitioning of Section 3.5.2 is from the paper by Saad (2003a), which also describes some modifications of the basic approach; more sophisticated schemes with overlapping blocks are given in Fritzsche et al. (2013).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 4 Sparse Cholesky Solver: The Symbolic Phase**

*The modern view of numerical linear algebra as being to a large extent the study and systematic use of matrix decompositions has certainly been influenced by Cholesky's posthumously published work – Benzi (2017).*

This chapter focuses on the symbolic phase of a sparse Cholesky solver. The sparsity pattern S{*A*} of the symmetric positive definite (SPD) matrix *A* is used to determine the nonzero structure of the Cholesky factor *L* without computing the numerical values of the nonzeros. The subsequent numerical factorization is discussed in the next chapter. Because the symbolic phase works only with S{*A*} (the values of the entries of *A* are not considered), it is also used for symmetric indefinite matrices and sometimes within LU factorizations of symmetrically structured nonsymmetric problems. It is implicitly assumed that all the diagonal entries of *A* are included in S{*A*} (even if they are zero). During the factorization phase, it may be necessary to amend the data structures to allow for indefiniteness. This makes the factorization of indefinite matrices potentially more expensive and more complex; this is considered further in Chapter 7.

A fundamental difference between dense and sparse Cholesky factorizations is that, in the latter, each column of *L* depends on only a subset of the previous columns. The elimination tree is a data structure that describes the dependencies among the columns of *A* during its factorization. A key result that assists in the understanding of sparse Cholesky factorizations is that the sparsity pattern of column *j* of *L* is the union of the pattern of column *j* of the lower triangular part of *A* and the patterns of the children of *j* in the elimination tree; this is shown in Section 4.3. Furthermore, the fact that disjoint parts of the elimination tree can be factored independently offers the potential for high-level tree-based parallelism that does not exist for dense matrices.

#### **4.1 Column Replication Principle**

We begin by looking at how the sparsity pattern of a computed column of *L* influences the patterns of subsequent Schur complements. From (3.2), the Schur complement *S(k)* can be written as ⎛⎜⎝⎞⎟⎠

This on suusequent \*\*scınır\*\* compnenens.\*\* From (5.2), une \*\*scınır\*\* in be written as\*\*

$$S^{(k)} = A\_{k:n,k:n} - \sum\_{j=1}^{k-1} \begin{pmatrix} l\_{kj} \\ \vdots \\ l\_{nj} \end{pmatrix} (l\_{kj} \dots l\_{nj}) \,. \tag{4.1}$$

Consider column *j* of *L* (1 ≤ *j* ≤ *k* − 1), and let *lij* = 0 for some *i>j* . The involvement of *lij* in the outer product in (4.1) allows the following observation.

**Observation 4.1** *For any i>j* ≥ 1 *such that lij* = 0

$$\mathcal{S}\{L\_{i:n,j}\} \subseteq \mathcal{S}\{L\_{i:n,i}\}.\tag{4.2}$$

*This is called the* **column replication principle** *because the pattern of column j of L (rows i to n) is replicated in the pattern of column i of L.*

Denote the row index of the first subdiagonal nonzero entry in column *j* of *L* by *parent (j )*, that is,

$$parent(j) = \min\{i \mid i > j \text{ and } l\_{lj} \neq 0\}.\tag{4.3}$$

If there is no such entry, set *parent (j )* = 0. The row index *parent (parent (j ))* is denoted by *parent*2*(j )*, and so on. Applying column replication recursively implies the sparsity pattern of column *j* of *L* is replicated in that of column *parent (j )*, which in turn is replicated in the pattern of column *parent*2*(j )*, and so on. This is illustrated in Figure 4.1. Here *j* = 1, and because the first subdiagonal entry in column 1 is in row 3, *parent (*1*)* <sup>=</sup> 3. Likewise, *parent (*3*)* <sup>=</sup> *parent*2*(*1*)* <sup>=</sup> 5.


**Figure 4.1** An illustration of column replication. On the left are the entries in *L* before step 1 of a Cholesky factorization (that is, the entries in the lower triangular part of *A*); in the centre, we show the replication of the nonzeros from column 1 in the pattern of column *parent (*1*)* = 3 (red entries *<sup>f</sup>* ); on the right, we show the subsequent replication in column *parent*2*(*1*)* <sup>=</sup> 5.

The following result shows that, provided *A* is irreducible, the mapping *parent (j )* has nonzero values given by (4.3) for all *j<n*.

**Theorem 4.1 (Liu 1986)** *If A is SPD and irreducible, then in each column j (*1 ≤ *j<n) of its Cholesky factor L there exists an entry lij* = 0 *with i>j .*

*Proof* From Parter's rule, each step of the Cholesky factorization corresponds to adding new edges into the graph of the corresponding Schur complement. If *A* is irreducible, then the graphs corresponding to the Schur complements are connected. Consequently, for any vertex *j* (1 ≤ *j<n*) in any of these graphs, there is at least one vertex *i* with *i>j* to which *j* is connected. This corresponds to the nonzero entry in column *j* of *L*.

With the convention *parent*1*(j )* <sup>=</sup> *parent (j )*, the next theorem shows that if entry *lij* of *L* is nonzero, then *parent<sup>t</sup> (j )* = *i* for some *t* ≥ 1, and there is an entry in row *i* of *L* in each of the columns in the replication sequence *j, parent*1*(j ), parent*2*(j ), . . . , parent<sup>t</sup> (j ).*

**Theorem 4.2 (Liu 1990; George 1998)** *Let A be SPD, and let L be its Cholesky factor. If lij* <sup>=</sup> <sup>0</sup> *for some j<i* <sup>≤</sup> *<sup>n</sup>, then there exists <sup>t</sup>* <sup>≥</sup> <sup>1</sup> *such that parent<sup>t</sup> (j )* = *<sup>i</sup> and lik* <sup>=</sup> <sup>0</sup> *for <sup>k</sup>* <sup>=</sup> *<sup>j</sup> , parent*1*(j ), parent*2*(j ), . . . , parent<sup>t</sup> (j ).*

*Proof* If *<sup>i</sup>* <sup>=</sup> *parent*1*(j )*, the result is immediate. Otherwise, there exists an index *<sup>k</sup>*, *j<k<i* of a subdiagonal entry in column *<sup>j</sup>* of *<sup>L</sup>* such that *<sup>k</sup>* <sup>=</sup> *parent*1*(j )*. Column replication implies *lik* = 0. Applying an inductive argument to *lik*, the result follows after a finite number of steps.

If there is a sequence of nonzeros in a row of *L*, it is natural to ask where the sequence begins. It is straightforward to see if there is no *k* ≥ 1 such that *aik* = 0, no replication of nonzeros can start in row *i*. The main result on the replication of nonzeros of *A* is summarized as Theorem 4.3.

**Theorem 4.3 (Liu 1986)** *Let A be SPD, and let L be its Cholesky factor. If aij* = 0 *for some* 1 ≤ *j<i* ≤ *n, then there is a filled entry lij* = 0 *if and only if there exist k<j and <sup>t</sup>* <sup>≥</sup> <sup>1</sup> *such that aik* <sup>=</sup> <sup>0</sup> *and parent<sup>t</sup> (k)* = *j .*

#### **4.2 Elimination Trees**

The discussion of column replication is significantly simplified using elimination trees. The **elimination tree** (or **etree**) T *(A)* (or simply T ) of an SPD matrix has vertices 1*,* 2*,...,n* and an edge between each pair *(j, parent (j ))*, where *parent (j )* is given by (4.3); *j* is a root vertex of the tree if *parent (j )* = 0. The edges of T are considered to be directed from a child to its parent, that is,

$$\mathcal{E}(\mathcal{T}) = \{ (j \longrightarrow i) \mid i = parent(j) \}.$$

**Figure 4.2** An illustration of a sparse matrix *A* with a symmetric sparsity pattern and its elimination tree <sup>T</sup> *(A)*. The root vertex is 8. The filled entries in <sup>S</sup>{*<sup>L</sup>* <sup>+</sup> *LT* } are denoted by *<sup>f</sup>* .

If T has a single component, then the root vertex is *n*. Despite the terminology, the elimination tree need not be connected and in general is a **forest**. For simplicity, in our discussions, we assume T has a single component, and we say that T is described by the vector *parent*.

An example of a matrix and its elimination tree is given in Figure 4.2. Here and elsewhere, following conventional notation, directional arrows are omitted from the tree plot.

Concepts such as child, leaf, ancestor, and descendant vertices introduced in Section 2.3 for directed rooted trees can be applied to T . Additionally, *anc*<sup>T</sup> {*j* } and *desc*<sup>T</sup> {*j* } are defined to be the sets of ancestors and descendants of vertex *j* in T . We denote by T *(j )* the **subtree** of T induced by *j* and *desc*<sup>T</sup> {*j )*; *j* is the root vertex of T *(j )*. The **size** |T *(j )*| is the number of vertices in the subtree. A **pruned subtree** of T *(j )* is the connected subgraph induced by *j* and a subset of *desc*<sup>T</sup> {*j )*. That is, for any vertex *i* in a pruned subtree of T *(j )*, all the ancestors of *i* belong to the pruned subtree. A pruned subtree of T shares the mapping *parent* with T .

The following observation is straightforward.

**Observation 4.2** *If i* ∈ *anc*<sup>T</sup> {*j* } *for some j* = *i, then i>j .*

The connection between the mapping *parent* and the sets of ancestors and descendants is emphasized by the next observation.

**Observation 4.3** *If i and j are vertices of the elimination tree* T *with j<i* ≤ *n, then*

*<sup>i</sup>*<sup>∈</sup> *anc*<sup>T</sup> {*<sup>j</sup>* } *if and only if <sup>j</sup>* <sup>∈</sup> *desc*<sup>T</sup> {*i*} *if and only if parent<sup>t</sup> (j )*= *i for some t* ≥1*.*

The results in Section 4.1 can be expressed using rooted trees. Consider, for example, Theorem 4.2. Instead of stating that there exists *t* ≥ 1 such that *parent<sup>t</sup> (j )* = *i*, we can write that *i* ∈ *anc*<sup>T</sup> {*j* }. Rewriting Theorem 4.3 as the following corollary provides a clear characterization of the sparsity patterns of the rows of *L*.

**Figure 4.3** The row subtree T*r(*5*)* of the elimination tree T from Figure 4.2 (left). Vertex 3 has been pruned because *a*<sup>35</sup> = 0. The row subtree T*r(*8*)* (right) differs from T = T *(A)* because vertex 1 has been pruned (*a*<sup>18</sup> = 0).

**Corollary 4.4 (Liu 1986)** *Consider the elimination tree* T *and the Cholesky factor L of A. If i and j are vertices of* T *with j<i* ≤ *n and aij* = 0*, then lij* = 0 *if and only if there exists k<j such that j* ∈ *anc*<sup>T</sup> *(k) and aik* = 0*.*

The subtree of T with vertices that correspond to the nonzeros of row *i* of *L* is called the *i*-th **row subtree** and is denoted by T*r(i)*. Formally, it is a pruned subtree of T induced by the union of the vertex set

$$\{i\} \cup \{k \mid a\_{ik} \neq 0 \text{ and } k < i\}$$

with all vertices on the directed paths in T from *k* to *i*, that is, with all their ancestors from T*r(i)*. The root vertex is *i*, and the leaf vertices are a subset of the column indices in the *i*-th row of the lower triangular part of *A*. Figure 4.3 illustrates row subtrees for the matrix and elimination tree from Figure 4.2. Note that row subtrees are connected subgraphs of T , even if T is not connected. If T can be found without determining the pattern of *L*, then T*r(i)* can be used to derive the sparsity pattern of row *i* of *L*, without having to store each entry explicitly.

Theorem 4.5 characterizes the ancestors of a given vertex *j* using paths in G*(A)*. The proof helps clarify the relationship between T and paths in G*(A)*.

**Theorem 4.5 (Schreiber 1982; Liu 1986)** *If i and j are vertices in the elimination tree* T *with j<i* ≤ *n, then i* ∈ *anc*<sup>T</sup> {*j* } *if and only if there exists a path*

$$j \xleftarrow[{\mathcal{G}(A)}\_{\{1,\ldots,i\}}] \ i. \tag{4.4}$$

*Proof* Assume *<sup>i</sup>* <sup>∈</sup> *anc*<sup>T</sup> {*<sup>j</sup>* }. Then there is a path *<sup>j</sup>* <sup>T</sup> ⇒ *<sup>i</sup>* of length *<sup>l</sup>* <sup>≥</sup> 1. Each edge of this path belongs to G*(L)* and corresponds either to an edge in G*(A)* or to a fill-path in G*(A)*. Connecting these paths together gives (4.4).

Conversely, if the path (4.4) exists, induction on its length can be used to prove the result. If the path is of length 1, then the result holds because *i* and *j* are connected in G*(A)* by an edge. Consequently, from Theorem 4.2, *i* is an ancestor of *j* . Now assume that the result is true for all paths of length less than *l* (*l >* 1), and consider a path of length *l*. Let *m* be the largest vertex on this path. If *m<j* , then (4.4) is a fill-path connecting *i* and *j* and, therefore, *i* ∈ *anc*<sup>T</sup> {*j* }. Otherwise, for *m* ≥ *j* , the assumption implies *i* ∈ *anc*<sup>T</sup> {*m*}∪{*m*} and *m* ∈ *anc*<sup>T</sup> {*j* }∪{*j* }, that is, *i* ∈ *anc*<sup>T</sup> {*j* }.

Given a vertex *j* in T , the following corollary indicates how to find *parent (j )* (if it exists). If the set of ancestors of *j* is non-empty, then the lowest numbered one is its parent.

**Corollary 4.6 (Liu 1986, 1990)** *Vertex i is the parent of vertex j in* T *if and only if i is the lowest numbered vertex satisfying j<i* ≤ *n for which there is a path (4.4).*

The existence of (4.4) is equivalent to requiring *i* and *j* belong to the same component of the graph G*(A*1:*i,*1:*i)* corresponding to the *i* × *i* principal leading submatrix *A*1:*i,*1:*<sup>i</sup>* of *A*. Figure 4.4 depicts G*(A)* for the matrix *A* given in Figure 4.2. Consider vertex 4. Its set of ancestors for which paths from Theorem 4.5 exist comprises vertices 5, 6, and 8. Vertex 7 is not an ancestor of 4 because there is no path from 7 to 4 in the graph G*(A*1:7*,*1:7*)*. Among the ancestors of 4, vertex 5 fulfils the condition from Corollary 4.6 and is thus the parent of 4.

T = T *(A)* can be constructed by stepwise extensions of the elimination trees of the principal leading submatrices of *A*. Assume we have T *(A*1:*i*−1*,*1:*i*−1*)* and we want to construct T *(A*1:*i,*1:*i)*. Initialize T *(A*1:*i,*1:*i)* = T *(A*1:*i*−1*,*1:*i*−1*)*. If there are no entries in row *i* of *A* to the left of the diagonal, then there is nothing to do, and only an isolated vertex *i* is added. Otherwise, *i* is the root of the row subtree T*r(i)* and an ancestor of some vertex *j* in T . The ancestors *k* of *j* with *k<i* are in T *(A*1:*i*−1*,*1:*i*−1*)*. Because row subtrees are connected subgraphs of T , a directed path in <sup>T</sup> *(A*1:*i,*1:*i)* with *parent<sup>t</sup> (j )* = *i* exists for some *t* ≥ 1. The search for this path starts from *j root* = *j* and continues, while *parent (j root)* = 0 and *parent (j root)* = *i*, using a sequence of assignments *j root* = *parent (j root)*. It terminates once *parent (j root)* = *i* or *i* is found to have already been added when

**Figure 4.4** The graph G*(A)* of the matrix from Figure 4.2 illustrating Theorem 4.5 and Corollary 4.6.

tracing the path from another entry *j* in row *i*. The construction of T is summarized in Algorithm 4.1.


The most expensive part of Algorithm 4.1 is the **while** loop that searches for subtree roots. Because the directed path from *j* to its root *parent<sup>t</sup> (j )* is unique, shortcuts can be incorporated; this is called **path compression**. Having found a directed path from *j* to *k*, subsequent searches can be made more efficient by introducing a vector *ancestor* and setting *ancestor(j )* = *k*. The modified algorithm is outlined in Algorithm 4.2. It maintains two structures using the current values of *parent* and *ancestor*. The tree described by *ancestor* is termed the **virtual tree**.

Figure 4.5 shows a matrix for which path compression makes constructing T significantly more efficient. For this example, T is determined by the mapping *parent (*6*)* = 0; *parent (i)* = *i* + 1 for *i* = 1*,...,* 5. The complexity of Algorithm 4.1 is *O(n*2*)*, but for this example the complexity of Algorithm 4.2 is *O(n)*. Formally, the complexity of Algorithm 4.2 is *O(nz(A)*log2*(n))*, where *nz(A)* is the number of nonzeros of *A*, but the logarithmic factor is rarely reached. Additional modifications can reduce the theoretical complexity to *O(nz(A) g(nz(A), n))*, where *g(nz(A), n)* is a very slowly increasing function called the functional inverse of Ackermann's function. This means that, in practice, the complexity of constructing T , and hence of obtaining an implicit representation of S{*L*}, is close to linear in *nz(A)* (which in general is much smaller than *nz(L)*).

**ALGORITHM 4.2 Construction of an elimination tree using path compression Input:** *A* with a symmetric sparsity pattern and its undirected graph G. **Output:** Elimination tree T described by the vector *parent*.

1: **for** *i* = 1 : *n* **do** Loop over the rows of *A* 2: *parent (i)* = 0, *ancestor(i)* = 0 Initialisation 3: **for** *j* ∈ *adj*G{*i*} and *j<i* **do** Loop over the below diagonal entries in row *i* 4: *j root* = *j* 5: **while** *ancestor(j root)* = 0 and *ancestor(j root)* = *i* **do** 6: *l* = *ancestor(j root)* 7: *ancestor(j root)* = *i* Path compression to accelerate future searches 8: *j root* = *l* 9: **end while** 10: **if** *ancestor(j root)* = 0 **then** 11: *ancestor(j root)* = *i* and *parent (j root)* = *i* 12: **end if** 13: **end for** 14: **end for**

```
⎛
⎜⎜⎜⎝
 ∗∗∗∗∗∗
 ∗ ∗
 ∗ ∗
 ∗ ∗
 ∗ ∗
 ∗ ∗
         ⎞
         ⎟⎟⎟⎠
```
**Figure 4.5** A sparse matrix for which computing the elimination tree using Algorithm 4.2 is much more efficient than using Algorithm 4.1.

The following simple theorem states that there is no edge in <sup>G</sup>*(L* <sup>+</sup> *<sup>L</sup><sup>T</sup> )* between vertices belonging to subtrees of T with different vertex sets. If there was such an edge *(s, t)*, then from Theorem 4.2, one of the vertices *s* and *t* must be an ancestor of the other, which is a contradiction. The importance of this result is that it implies that for any such pairs of vertices the corresponding column sparsity patterns in *L* can be computed in parallel.

**Theorem 4.7 (Liu 1990)** *Consider the elimination tree* T *and the Cholesky factor L of A. Let* T *(i) and* T *(j ) be two vertex-disjoint subtrees of* T *. Then for all s* ∈ T *(i) and t* ∈ T *(j ), the entry lst of L is zero.*

#### **4.3 Sparsity Pattern of** *L*

The explicit structure of *L* is not always required; sometimes only the numbers of nonzeros in each row and column of *L* are needed. For example, when comparing the amount of fill-in in the factors for different initial orderings of *A*, allocating factor storage, finding relaxed supernodes (see Section 4.6), and determining load balance and synchronization events in parallel factorizations.

Let *rowL*{*i*} denote the sparsity pattern of the off-diagonal part of row *i* of *L*, that is,

$$row\_L\{i\} = \mathcal{S}\{L\_{i,1:i-1}\} = \{j \mid j < i, \ l\_{lj} \neq 0\}, \quad 1 \le i \le n.$$

The number of entries in *L* is

$$\begin{aligned} \left| \begin{array}{l} \left( \left( \left( \right)^{n} \right)^{-1} \right)^{-1} \right| & \left( \left( \left( \right)^{n} \right)^{-1} \right) \\\\ \text{is} \\\\ \left| nz(L) = \sum\_{i=1}^{n} \left| row\_{L}\{i\} \right| + n \dots \end{aligned} $$

Corollary 4.4 implies *rowL*{*i*} is given by the vertices of the row subtree T*r(i)*. This suggests Algorithm 4.3. Here the vector *mark* is used to flag vertices so as to avoid including them more than once within a row subtree. The complexity of the algorithm is *O(nz(L))*.

#### **ALGORITHM 4.3 Computation of the row sparsity patterns of the Cholesky factor** *L*

**Input:** *A* with a symmetric sparsity pattern, its undirected graph G and elimination tree T described by the vector *parent*.

**Output:** Row sparsity patterns *rowL*{*i*} of the Cholesky factor *L* of *A* (1 ≤ *i* ≤ *n*).

1: **for** *i* = 1 : *n* **do** Loop over the rows of *A* 2: *rowL*{*i*}=∅ Initialisation 3: *mark(i)* = *i* 4: **for** *k* ∈ *adj*G{*i*} and *k<i* **do** Loop over the below diagonal entries in row *i* 5: *j* = *k* 6: **while** *mark(j )* = *i* **do** Column *j* not yet encountered in row *i* 7: *mark(j )* = *i* Flag *j* as encountered in row *i* 8: *rowL*{*i*} = *rowL*{*i*}∪{*j* } Add *j* to the sparsity pattern of row *i* 9: *j* = *parent (j )* Move up the elimination tree 10: **end while** 11: **end for** 12: **end for**

**Figure 4.6** An illustration of the sparsity pattern of *A* and its graph G*(A)* (left) and the sparsity pattern of the corresponding skeleton matrix *A*<sup>−</sup> and graph G*(A*−*)* (right). The entries in *A* and edges of G*(A)* that do not belong to the skeleton matrix and graph are depicted in red.

Efficiency can be improved by employing the **skeleton graph** G*(A*−*)* that is obtained from G*(A)* by removing every edge *(i, j )* for which *j<i* and *j* is not a leaf vertex of T*r(i)*. G*(A*−*)* is the smallest subgraph of G*(A)* with the same filled graph as G*(A)*. The corresponding matrix is the **skeleton matrix**. An example is given in Figure 4.6. The complexity of constructing the elimination tree using the skeleton matrix and its graph G*(A*−*)* is *O(nz(A*−*) g(nz(A*−*), n))*, where *nz(A*−*)* is the number of entries in the skeleton matrix. Because *nz(A*−*)* is often significantly smaller than *nz(A)*, an implementation that processes G*(A*−*)* rather than G*(A)* can be substantially faster.

Analogously to the row sparsity patterns, let *colL*{*j* } denote the sparsity pattern of the off-diagonal part of column *j* of *L*, that is,

$$col\_L\{j\} = \mathcal{S}(L\_{j+1:n,j}) = \{i \mid i > j, \ l\_{lj} \neq 0\}, \quad 1 \le j \le n.$$

The column replication principle can be written as

$$col\_L\{j\} \subseteq col\_L\{parent(j)\} \cup parent(j).$$

Theorem 4.8 describes *colL*{*j* } using the vertices of the subtree T *(j )*.

**Theorem 4.8 (George & Liu 1980c, 1981)** *The column sparsity pattern colL*{*j* } *of the Cholesky factor L of the matrix A is equal to the adjacency set of vertices of the subtree* T *(j ) in* G*(A), that is,*

**Figure 4.7** Two topological orderings of an elimination tree.

$$col\_L\{j\} = adj\_{\mathcal{G}(A)}\{\mathcal{T}(j)\}.\tag{4.5}$$

*Proof* If *i* ∈ *colL*{*j* }, then *j* ∈ *rowL*{*i*}, and Theorem 4.3 implies *j* ∈ *anc*<sup>T</sup> {*k*} for some *k* such that *aik* = 0. That is, *i* ∈ *adj*G{T *(j )*}. Conversely, *i* ∈ *adj*G{T *(j )*} implies that in row *i* the entry in column *j* of *L* is nonzero. Thus, *j* ∈ *rowL*{*i*}, and hence, *i* ∈ *colL*{*j* }.

Algorithm 4.3 can be used to compute the column counts and the column sparsity patterns because when *j* is added to *rowL*{*i*} at line 8, *i* can be added to *colL*{*j* }. This does not generally obtain the column sparsity patterns sequentially. To derive an approach that does compute them sequentially, rewrite (4.5) as follows: ⎛⎝⎞⎠

$$\begin{aligned} & \text{is not generally obtain the column sparsity patterns sequentially.}\\ & \text{each that does compute them sequentially, rewrite } (4.5) \text{ as follows:}\\ & \begin{pmatrix} \\ \end{pmatrix} = \left( \underset{\{k \mid k \in \overline{\mathcal{T}}(j) \mid \{j\}\}}{\text{adj}} \underset{\{k \mid \}}{\text{col}} \right) \end{aligned}$$

Using the column replication, this can be significantly simplified ⎝⎠

$$\text{g the column relation, this can be significantly simplified}$$

$$col\_L\{j\} = \left(adj\_{\mathcal{G}(A)}\{j\} \bigcup\_{\{k \mid j = parent(k)\}} col\_L\{k\}\right) \tag{4.6}$$

This is used to obtain Algorithm 4.4, which constructs the sparsity pattern of each column *j* of *L* as the union of the sparsity pattern of column *j* of *A* (*adj*G*(A)*{*j* }) and the patterns of the children of *j* in T *(A)*. Here *child*{*j* } denotes the set of children of *j* . Because any child *k* of *j* satisfies *k<j* , the *j* -th outer step has the information needed to compute the sparsity pattern described by (4.6). Observe that T *(A)* does not need to be input.

### **ALGORITHM 4.4 Determining the sparsity patterns of each column of** *L* **Input:** *A* with symmetric sparsity pattern and its undirected graph G. **Output:** Column sparsity patterns *colL*{*j* } of the Cholesky factor *L* of *A* (1 ≤ *j* ≤ *n*).

```
1: for j = 1 : n do  Loop over the columns of L
2: child{j }=∅  Initialisation
3: colL{j } = adjG{j }\{1,...,j − 1}
4: for k ∈ child{j } do  Unifying child structures in (4.6)
5: colL{j } = colL{j } ∪ colL{k}\{j }
6: end for
7: if colL{j } 
= ∅ then
8: l = min{i | i ∈ colL{j }}
9: child{l} = child{l}∪{j }  Parent of j detected using Corollary 4.6
10: end if
11: end for
```
### **4.4 Topological Orderings**

The outer loop in Algorithm 4.4 does not have to be performed in the strict order *j* = 1*,...,n*. What is necessary is that for each step *j* , the column sparsity pattern for each child of *j* has already been computed. An ordering of the vertices in a tree (and, more generally, in a DAG) is a topological ordering if, for all *i* and *j* , *j* ∈ *desc*<sup>T</sup> {*i*} implies *j<i* (Section 2.2). Observation 4.2 confirms that the ordering of vertices in the elimination tree T is a topological ordering. A new topological ordering of T defines a relabelling of its vertices corresponding to a symmetric permutation of *A*. This is illustrated in Figure 4.7. The sparsity patterns of the Cholesky factors of *A* and *P AP<sup>T</sup>* can be different, but the following result shows that the amount of fill-in is the same.

**Theorem 4.9 (Liu 1990)** *Let* S{*A*} *be symmetric. If P is the permutation matrix corresponding to a topological ordering of the elimination tree* T *of A, then the filled graphs of A and P AP<sup>T</sup> are isomorphic.*

There are many topological orderings of T . One class is obtained using the depthfirst search given by Algorithm 2.1. This searches all the components of T starting at their root vertices. In this case, once vertex *i* has been visited, all the vertices of the subtree T *(i)* are visited immediately after *i* and *i* is labelled as the last vertex of T *(i)*. A topological ordering of T is a **postordering** if the vertex set of any subtree T *(i)* (*i* = 1*,...,n*) is a contiguous sublist of 1*,...,n*. Unless additional rules on how vertices are selected are imposed, a postordering is generally not unique, as demonstrated in Figure 4.8. One possible postordering is defined in Algorithm 2.1. In this case, there is some freedom in the depth-first search to choose from the vertices that have not been visited, resulting in different postorderings.

**Figure 4.8** An example to illustrate the non-uniqueness of postorderings of an elimination tree.

#### **4.5 Leaf Vertices of Row Subtrees**

Leaf vertices of row subtrees play a key role in graph algorithms related to sparse Cholesky factorizations. They can be used to find the skeleton matrix described in Section 4.3, and they are important in parallel processing based on fundamental supernodes (see Section 4.6.1). Theorem 4.10 describes the relation between standard subtrees of T and row subtrees obtained by pruning (Section 4.2). This pruning is determined by the leaf vertices of row subtrees.

**Theorem 4.10 (Liu 1986)** *Let the elimination tree* T *of A be postordered. Let the column indices of the nonzeros in the strictly lower triangular part of row i of A be c*1*,...,cs with s* ≥ 1 *and* 0 *< c*<sup>1</sup> *<...< cs < i. Then ct is a leaf vertex of the row subtree* T*r(i) if and only if*

$$t = 1 \quad or \ \ (1 < t \le s \quad \text{and} \ c\_{t-1} \notin \mathcal{T}(c\_t)).$$

*Proof c*<sup>1</sup> is always a leaf vertex of T*r(i)*. If this is not the case, then there exists a directed path from some vertex *k, k* = *c*<sup>1</sup> to *i* via *c*<sup>1</sup> such that *k* ∈ T*r(i)* and *aik* = 0. Postordering of T implies *k<c*1. This is a contradiction because *c*<sup>1</sup> is the index of the first nonzero in row *i*.

Consider now *t >* 1. Assume that *ct*−<sup>1</sup> ∈ T *(ct)* and that *ct* is a leaf vertex of T*r(i)*. Row replication (Theorem 4.2) implies any *k* ∈ *anc*<sup>T</sup> {*ct*−1}∪{*ct*−1} such that *ct*−<sup>1</sup> ≤ *k<i* satisfies *lik* = 0. Because T is postordered, *ct*−<sup>1</sup> ≤ *k* ≤ *ct* , and there is at least one *k<ct* satisfying this inequality. It follows that *k* = *ct*−1. Because *k* belongs to T*r(i)*, *ct* cannot be a leaf vertex of T*r(i)*, which is a contradiction.

Conversely, assume for *t >* 1 that *ct*−<sup>1</sup> ∈ T *(ct)* and *ct* is not a leaf vertex of T*r(i)*. From the second part of the assumption and the fact that *ct* ∈ T*r(i)*, it follows that there is at least one leaf vertex *k<i* of T*r(i)* from which there is a directed path to *i* via *ct* . Thus *k<ct* . From the definition of the postordering of T , all vertices *l* with *k<l* ≤ *ct* are vertices of T *(ct)*. Vertex *ct*−<sup>1</sup> must be among them and *ct*−<sup>1</sup> ∈ T *(ct)*. This contradiction completes the proof. **ALGORITHM 4.5 Find the sizes of subtrees** T *(i)* **of** T **Input:** Elimination tree T described by the vector *parent*. **Output:** Subtree sizes |T *(i)*| (1 ≤ *i* ≤ *n*).

1: |T *(*1 : *n)*| = 1 2: **for** *i* = 1 : *n* − 1 **do** 3: *k* = *parent (i)* 4: |T *(k)*|=|T *(k)*|+|T *(i)*| 5: **end for**

**Corollary 4.11 (Liu 1986)** *Under the assumptions of Theorem 4.10, ct is a leaf vertex of* T*r(i) if and only if*

$$t = 1 \text{ or } \begin{array}{c} \text{or} \ \vert \; 1 < t \le s \end{array} \text{ and } \begin{array}{c} c\_{l-1} < c\_l - \vert \mathcal{T}(c\_l) \vert + 1 \text{)}.$$

Subtree sizes can be computed using Algorithm 4.5. Correctness of Algorithm 4.5 is guaranteed because *parent* defines a topological ordering of T .

Theorem 4.12 relaxes the condition that the entries in the rows of *A* are sorted by increasing column indices. This allows the leaf vertices of the row subtrees to be determined by columns.

**Theorem 4.12 (Liu et al. 1993)** *Consider the elimination tree* T *of A. Vertex j is a leaf vertex of some row subtree of* T *if and only if there exists i* ∈ *adj*G*(A)*{*j* }*, j<i* ≤ *n, such that i* ∈ *adj*G*(A)*{*k*} *for all k* ∈ T *(j )* \ {*j* }*.*

*Proof* Assume that for some *i* ∈ *anc*<sup>T</sup> {*j* } vertex *j* is a leaf vertex of T*r(i)*. That is, *i* ∈ *adj*G*(A)*{*j* }, *i>j* . Suppose there exists *k* in T *(j )*\{*j* } such that *i* ∈ *adj*G*(A)*{*k*}. Then all the ancestors of *k, k* ≤ *i*, in particular *j* , belong to T*r(i)* and *j* cannot be a leaf vertex of T*r(i)*. This is a contradiction.

Conversely, assume that *j* is not a leaf vertex of any row subtree of T and that there exists *i* ∈ *adj*G*(A)*{*j* }, *j<i* ≤ *n*, such that *i* ∈ *adj*G*(A)*{*k*} for all *k* ∈ T *(j )* \ {*j* }. Because *j* is not a leaf vertex of any such T*r(i)*, Theorem 4.3 implies that there exists *k* ∈ T *(j )* \ {*j* } such that *aik* = 0, which gives a contradiction and completes the proof.

To find leaf vertices of row subtrees of T , Algorithm 4.6 uses a marking scheme based on Theorem 4.12 and exploits the postordering of T . The auxiliary vector *prev*\_*nonz* stores the column indices of the most recently encountered entries in the rows of the strictly lower triangular part of *A*.

#### **4.6 Supernodes and the Assembly Tree**

Because of column replication, the columns of *L* generally become denser as the Cholesky factorization proceeds. Exploiting this density can significantly enhance

## **ALGORITHM 4.6 Find leaf vertices of row subtrees of** T

**Input:** *A* with a symmetric sparsity pattern and a corresponding postordered elimination tree T .

**Output:** Logical vector *isleaf* with entries set to true for leaf vertices of row subtrees.

1: *isleaf (*1 : *n)* = *f alse*, *prev*\_*nonz(*1 : *n)* = 0 2: Compute |T *(*1 : *n)*| Use Algorithm 4.5 3: **for** *j* = 1 : *n* **do** Loop over the columns of *A* 4: **for** *i* such that *i>j* and *aij* = 0 **do** Row index in strictly lower triangular part of *A* 5: *k* = *prev*\_*nonz(i)* Column index of most recently seen entry in row *i* 6: **if** *k<j* − |T *(j )*| + 1 **then** 7: *isleaf (j )* = *true j* is a leaf vertex by Corollary 4.11 8: **end if** 9: *prev*\_*nonz(i)* = *j* Flag *j* as the most recently seen entry in row *i* 10: **end for** 11: **end for**

the performance of the numerical factorization in terms of both computation time and memory requirements. For this, we require the concept of supernodes. The idea is to group together columns with the same sparsity structure, so that they can be treated as a dense matrix for storage and computation. Let 1 ≤ *s,t* ≤ *n* with *s* + *t* − 1 ≤ *n*. A set of contiguously numbered columns of *L* with indices *S* = {*s, s* + 1*,...,s* + *t* − 1} is a **supernode** of *L* if

$$col\_L\{\mathbf{s}\} \cup \{\mathbf{s}\} = col\_L\{\mathbf{s} + t - 1\} \cup \{\mathbf{s}, \dots, \mathbf{s} + t - 1\},\tag{4.7}$$

and *S* cannot be extended for *s >* 1 by adding *s* − 1 or for *s* + *t* − 1 *< n* by adding *s* + *t*. Because *S* cannot be extended, it is a **maximal** subset of column indices. In graph terminology, a supernode is a **maximal clique** of contiguous vertices of <sup>G</sup>*(L* <sup>+</sup> *<sup>L</sup><sup>T</sup> )*. A supernode may contain a single vertex. Figure 4.9 illustrates the supernodes in a Cholesky factor of order 8.

The **supernodal elimination** or **assembly tree** is defined to be the reduction of the elimination tree that contains only supernodes. Each vertex of the elimination tree is associated with one elimination, and a single integer (the index of its parent) is needed. Associated with each vertex of the assembly tree is an index list of the row indices of the nonzeros in the columns of the supernode. These implicitly define the sparsity pattern of *L*. An example that demonstrates the difference between the elimination and assembly trees is given in Figure 4.10. Here the elimination tree is postordered, and there are 5 supernodes: {1*,* 2}, 3, 4, 5, {6*,* 7*,* 8*,* 9}. For supernode 1 that comprises columns 1 and 2, the row index list is {1*,* 2*,* 8*,* 9}.

$$L = \begin{array}{ccccc} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ & & 2 & \begin{pmatrix} \* & & & & & & \\ \* & & \* & & & & & \\ & & \* & & & & & \\ & & & \* & & & & \\ & & \* & \* & \* & & & \\ \* & \* & \* & \* & \* & \* & \\ & & & \* & \* & \* & \* & \\ & & & \* & \* & \* & \* & \* \\ & & & \* & \* & \* & \* & \* & \* \\ \end{array} \end{array}$$

**Figure 4.9** An example to illustrate supernodes in *L*. The first supernode comprises columns 1 and 2, the second columns 3 and 4, and the third columns 5–8.

**Figure 4.10** A sparse matrix and its postordered elimination tree (left) and postordered assembly tree (right). Filled entries in <sup>S</sup>{*<sup>L</sup>* <sup>+</sup> *LT* } are denoted by *<sup>f</sup>* . For the assembly tree, the vertices are in red and the index lists associated with each vertex are given.

Supernodes can be characterized by the following result on the column counts of *L*, from which we see that supernodes can be found using column counts rather than the column sparsity patterns that appear in (4.7).

**Theorem 4.13 (Liu et al. 1993)** *The set of columns of L with indices S* = {*s, s* + 1*,...,s* + *t* − 1} *is a supernode of L if and only if it is a maximal set of contiguous columns such that s* + *i* − 1 *is a child of s* + *i for i* = 1*,...,t* − 1 *and*

$$\left|col\_L\{\mathbf{s}\}\right| = \left|col\_L\{\mathbf{s} + t - 1\}\right| + t - 1. \tag{4.8}$$

*Proof* Let *S* be a supernode. For *i, j* ∈ *S* with *i>j* , we have *i* ∈ *colL*{*j* }. This implies that in the postordered elimination tree the vertex *i* = *j* + 1 is the parent of *j* for *j* = *s,... , s* + *t* − 2. Moreover, from Observation 4.2, for any *i, j* ∈ *S* with *i>j* , *i* ∈ *colL*{*j* } implies *colL*{*j* }\{1*,...,i*} ⊆ *colL*{*i*}*.* Therefore,

$$|\operatorname{col}\_{L}\{\mathbf{s} + i\}| \ge |\operatorname{col}\_{L}\{\mathbf{s} + i - 1\}| - 1, \quad i = 1, \ldots, t - 1,\tag{4.9}$$

with equality if and only if

$$col\_L\{\mathbf{s} + i\} = col\_L\{\mathbf{s} + i - 1\} \mid \{\mathbf{s} + i\},$$

that is, if *S* is a supernode.

Conversely, assume *S* is a maximal set of contiguous columns such that, for *i* = 1*,...,t* − 1, *s* + *i* − 1 is a child of *s* + *i* and *S* satisfies (4.8). Because of column replication, such a sequence of parent and child vertices must satisfy (4.9) with equality if and only if (4.7) is satisfied. It follows that *S* is a supernode.

Supernodes enhance the efficiency of sparse factorizations and sparse triangular solves because they enable floating-point operations to be performed on dense submatrices rather than on individual nonzeros, thus improving memory hierarchy utilization and allowing the use of highly efficient dense linear algebra kernels (such as Level 3 BLAS kernels). Because the rows and columns of a supernode have a common sparsity structure, this only needs to be stored once, reducing indirect addressing. Supernodes help to increase the granularity of tasks, which is useful for improving the computation to overhead ratio in a parallel implementation. Fill-in results in supernodes near the root of the assembly tree often being much larger than those close to the leaf vertices.

Observe that the columns within a supernode are numbered consecutively, but they can be numbered within the supernode in any order without changing the number of nonzeros in *L* (assuming the corresponding rows are permuted symmetrically). On some architectures, particularly those using GPUs, this freedom can be exploited to improve the factorization efficiency. Essentially, it is desirable to order the columns within a supernode such that the entries of *L* form fewer but less fragmented dense blocks.

Some applications, such as power grid analysis, in which the basis of the linear system is not a finite element or finite difference discretization of a physical domain, can lead to sparse matrices that incur very little fill-in during factorization. The supernodes can then be very small, and the costs associated with identifying them may not be offset by the increase in performance resulting from the potential for block operations. However, as supernodes can offer such significant performance gains, it can be advantageous to merge (small) supernodes that have similar (but not exactly the same) nonzero patterns, despite this increasing the overall fill-in and operation count. This process is termed **supernode amalgamation**, and the resultant nodes are often referred to as **relaxed supernode**.

#### *4.6.1 Fundamental Supernodes*

In practice, fundamental supernodes are easier to work with in the numerical factorization. Let 1 ≤ *s,t* ≤ *n* with *s* + *t* − 1 ≤ *n*. A maximal set of contiguously numbered columns of *L* with indices *S* = {*s, s* +1*,...,s* +*t* −1} is a **fundamental supernode** if for any successive pair *i* −1 and *i* in the list, *i* −1 is the only child of *i* in T and *colL*{*i*}∪{*i*} = *colL*{*i* − 1}. *s* is termed the starting vertex. An example is given in Figure 4.11. The difference between the sets of supernodes and fundamental supernodes is normally not large, with the latter having (slightly) more blocks in the resulting partitioning of *L*. Note that fundamental supernodes are independent of the choice of the postordering of T . Theorem 4.14 describes the relationship between fundamental supernodes and the leaf vertices of row subtrees of T . In particular, it characterizes starting vertices of the fundamental supernodes. The leaf vertices of T are trivially starting vertices of fundamental supernodes. But, possibly surprisingly, so too are the leaf vertices of row subtrees.

**Theorem 4.14 (Liu et al. 1993)** *Assume* T *is postordered. Vertex s is the starting vertex of a fundamental supernode if and only if it has at least two child vertices in* T *or it is a leaf vertex of a row subtree of* T *.*

*Proof* If *s* has at least two child vertices then, from the definition of a fundamental supernode, it must be the starting vertex of a fundamental supernode. Assume that, for some *i>s*, *s* is a leaf vertex of T*r(i)*. If *s* is also a leaf vertex of T , then *s* is a starting vertex of a supernode. The remaining case is *s* having only one child. Because T is postordered, this child must be *s* − 1. Theorem 4.3 then implies *ais* = 0 and *ai,s*−<sup>1</sup> = 0*,* that is, *i* ∈ *colL*{*s*} and *i /*∈ *colL*{*s* − 1}. It follows that

$$
\mathcal{S}\{L\_{s-1:n,s-1}\} \subsetneq \mathcal{S}\{L\_{s:n,s}\} \cup \{s-1\},
$$

and vertices *s* and *s* − 1 cannot belong to the same supernode. Hence, *s* is the starting vertex of a new fundamental supernode.

**Figure 4.11** A matrix *A* and its postordered elimination tree T for which the set of supernodes {1*,* 2} and {3*,* 4*,* 5*,* 6} and the set of fundamental supernodes {1*,* 2}*,*{3*,* 4} and {5*,* 6} are different. The filled entries in <sup>S</sup>{*<sup>L</sup>* <sup>+</sup> *LT* } are denoted by *<sup>f</sup>* .

Conversely, assume that *s* is the starting vertex of a fundamental supernode *S*. If *s* has no child vertices or at least two child vertices, the result follows. If *s* has exactly one child vertex, postordering implies this child is *s* − 1. Because *S* is maximal, there exists *i* such that *i* ∈ *colL*{*s* − 1} and *i* ∈ *colL*{*s*} (otherwise *S* could be extended by adding *s* − 1). Hence, *s* is a leaf vertex of T*r(i)*.

Because fundamental supernodes are characterized by their starting vertices, they can be found by modifying Algorithm 4.6 to incorporate marking leaf vertices of the row subtrees and vertices with at least two child vertices. Once the elimination tree has been computed, the complexity is *O(n*+*nz(A))*. The computation can be made even more efficient by using the skeleton graph G*(A*−*)*.

#### **4.7 Notes and References**

The excellent monographs by Tewarson (1973), George & Liu (1981), and Davis (2006) represent milestones in the development of contemporary symbolic factorization algorithms and their implementation. A complementary way to follow many of the developments is by looking at the early software (and accompanying user documentation), such as YSMP (Eisenstat et al., 1982) and SPARSPAK (George & Ng, 1984). In addition, there are several influential survey articles focusing on sparse Cholesky algorithms and emphasizing the crucial role of the elimination tree, for example, Liu (1990), George (1998); see also Bollhöfer & Schenk (2006), Hogg & Scott (2013a) and the more recent comprehensive survey of Davis et al. (2016). The latter provides a general overview of much of the research related to sparse direct methods and includes pointers to many specialized references.

There are a large number of journal articles that provide a fuller understanding of the theory and algorithms employed in symbolic factorizations. Schreiber (1982) defines the elimination tree of a sparse symmetric matrix. The seminal paper of Liu (1986) describes elimination tree construction, while for an extensive overview of the roles of elimination trees and topological orderings as well as the determination of the column sparsity patterns of the factor *L*, we refer to Liu (1990). If only row and column counts of *L* are needed, the fastest known algorithms are described in Gilbert et al. (1994). This paper also refers to another admirable paper of Liu et al. (1993) that describes the efficient computation of fundamental supernodes based on the leaf vertices of row subtrees of the elimination tree.

A key driver behind research into efficient (in terms of time and memory) sparse Cholesky algorithms has always been the development of computational codes. Many currently available packages implement not only sparse Cholesky factorizations but also more general LDLT factorizations of sparse symmetric indefinite matrices. The software is necessarily highly sophisticated and is therefore generally accompanied by technical reports and/or journal publications that explain the data structures and choices that were made in the algorithm and software design as well as providing details of the different options that are offered (examples include Duff (2004), Reid & Scott (2009), Hogg et al. (2010)).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 5 Sparse Cholesky Solver: The Factorization Phase**

*The adoption of Cholesky's method owes not a little to the publicity given to it shortly after the end of World War II by British mathematicians and computer pioneers, including Alan Turing, Leslie Fox, Jim Wilkinson, and especially John Todd – Benzi (2017).*

*Achieving high performance for sparse direct solvers in general, and sparse Cholesky factorization, in particular, is a very well researched topic – Rennich et al. (2016)*

Having considered the symbolic phase of a sparse Cholesky solver in the previous chapter, the focus of this chapter is the subsequent numerical factorization phase. If *A* is a symmetric positive definite (SPD) matrix, then it is factorizable (strongly regular) and (in exact arithmetic) its Cholesky factorization *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* exists. LDLT factorizations of general symmetric indefinite matrices are considered in Chapter 7.

### **5.1 Dense Cholesky Factorizations**

Because efficient implementations of sparse Cholesky factorizations rely heavily on exploiting dense blocks, we first consider algorithms for the Cholesky factorization of dense matrices that can be applied to such blocks. Algorithm 5.1 is a basic leftlooking algorithm. It is an in-place algorithm because *L* can overwrite the lower triangular part of *A* (thus reducing memory requirements if *A* is no longer required).

Writing *A* in the block form (1.2), the computation can be reorganized to give Algorithm 5.2. This allows the exploitation of Level 3 BLAS for the computationally intensive components (dense matrix-matrix multiplies and dense triangular solves). Here *A* has *nb* block columns, which are referred to as **panels**. Step 6 can be performed using Algorithm 5.1.

Algorithms 5.1 and 5.2 are left-looking. This means that the updates are not applied immediately. Instead, all updates from previous (block) columns are applied together to the current (block) column before it is factorized. In a right-looking

#### **ALGORITHM 5.1 In-place dense left-looking Cholesky factorization Input:** Dense SPD matrix *A*.

**Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .

1: **for** *j* = 1 : *n* **do** 2: *Lj* :*n,j* = *Aj* :*n,j* Only the lower triangular part of *A* is required 3: **for** *k* = 1 : *j* − 1 **do** 4: *Lj* :*n,j* = *Lj* :*n,j* − *Lj* :*n,k ljk* Update column *j* using previous columns 5: **end for** 6: *ljj* <sup>=</sup> *(ljj )*1*/*<sup>2</sup> Overwrite the diagonal entry with its square root 7: *Lj*+1:*n,j* = *Lj*+1:*n,j / ljj* Scale off-diagonal entries in column *j* 8: **end for**

#### **ALGORITHM 5.2 In-place dense left-looking panel Cholesky factorization**

**Input:** Dense SPD matrix *A* in the form (1.2) with *nb* panels. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .


approach (Algorithm 5.3), outer product updates are applied to the part of the matrix that has not yet been factored as they are generated.

The large panel updates can be split into operations involving only blocks. This is shown in Algorithm 5.4 for the right-looking approach.

The panel and block descriptions of the factorization enable efficient parallelization. The three main block operations, which are called tasks, are **factorize***(j b)*, **solve***(ib, j b)*, and **update***(ib, j b, kb)*. There are the following dependencies between the tasks.

**factorize***(j b)* depends on **update***(j b, kb, j b)* for all *kb*=1*,...,jb* − 1. **solve***(ib, j b)* depends on **update***(ib, kb, j b)* for all *kb*=1*,...,jb* − 1, and **factorize***(j b)*.

**update***(ib, j b, kb)* depends on **solve***(ib, kb)*, **solve***(j b, kb)*.

**ALGORITHM 5.3 In-place dense right-looking panel Cholesky factorization Input:** Dense SPD matrix *A* in the form (1.2) with *nb* panels. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* . 1: **for** *j b* = 1 : *nb* **do**

2: *Lj b*:*nb,j b* = *Aj b*:*nb,j b* 3: **end for** 4: **for** *j b* = 1 : *nb* **do** 5: Compute in-place factorization of *Lj b,j b* Overwrite *Lj b,j b* with its Cholesky factor 6: *Lj b*+1:*nb,j b* <sup>=</sup> *Lj b*+1:*nb,j b <sup>L</sup>*−*<sup>T</sup> j b,j b* Dense triangular solve 7: **for** *kb* = *j b* + 1 : *nb* **do** 8: *Lkb*:*nb,kb* <sup>=</sup> *Lkb*:*nb,kb* <sup>−</sup> *Lkb*:*nb,j b <sup>L</sup><sup>T</sup> kb,j b* 9: **end for** 10: **end for**

### **ALGORITHM 5.4 In-place dense right-looking block Cholesky factorization Input:** Dense SPD matrix *A* in the form (1.2) with *nb* × *nb* blocks. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .

1: **for** *j b* = 1 : *nb* **do** 2: *Lj b*:*nb,j b* = *Aj b*:*nb,j b* 3: **end for** 4: **for** *j b* = 1 : *nb* **do** 5: Compute in-place factorization of *Lj b,j b* Task **factorize***(j b)* 6: **for** *ib* = *j b* + 1 : *nb* **do** 7: *Lib,j b* <sup>=</sup> *Lib,j b <sup>L</sup>*−*<sup>T</sup> j b,j b* Task **solve***(ib, j b)* 8: **for** *kb* = *j b* + 1 : *ib* **do** 9: *Lib,kb* <sup>=</sup> *Lib,kb* <sup>−</sup> *Lib,j b <sup>L</sup><sup>T</sup> kb,j b* Task **update***(ib, j b, kb)* 10: **end for** 11: **end for** 12: **end for**

A dependency graph can be used to schedule the tasks. Its vertices correspond to tasks and dependencies between tasks are represented as directed edges. The result is a directed acyclic graph (DAG). A task is ready for execution if and only if all tasks with incoming edges to it have completed. DAG-driven linear algebra uses either a static or dynamic schedule based on these graphs to implement the tasks in a parallel environment. In practice, it is not necessary to explicitly compute the

⎝

⎠

task DAG: it can be constructed on-the-fly taking into account the dependencies. The task DAG allows a lot of flexibility in the order in which tasks are carried out: the left- and right-looking approaches correspond to particular restricted orderings of the tasks.

#### **5.2 Introduction to Sparse Cholesky Factorizations**

There are several classes of algorithms that implement sparse Cholesky factorizations. Their major differences relate to how they schedule the computations. This affects the use of dense kernels, the amount of memory required during the factorization as well as the potential for parallel implementations. As in the dense case, the factorization is split into tasks that involve computations on and between dense submatrices and the precedence relations among them can be captured by a task graph.

We start by extending the dense Cholesky factorizations to the sparse case in a straightforward way. In practice, it is essential for efficiency to exploit the supervariables of *A* and the supernodes of *L*. Thus, while for simplicity of the descriptions and notation, we refer to rows and columns of *A* and *L*, these typically represent block rows and block columns and, as in the above discussion of the dense block factorization algorithm, the entries of *A* and *L* are then submatrices. ⎛⎞⎛⎞

The entries of *L* satisfy the relationship ⎝⎠

$$\begin{aligned} \text{block factorization algorithm, the entries of } A \text{ and } L \text{ are then submanifolds.}\\ \text{The entries of } L \text{ satisfy the relationship} \\ L\_{j+1:n,j} &= \left( A\_{j+1:n,j} - \sum\_{k=1}^{j-1} L\_{j+1:n,k} l\_{jk} \right) / l\_{jj} \quad \text{with} \quad l\_{jj} = \left( a\_{jj} - \sum\_{k=1}^{j-1} l\_{jk}^2 \right)^{1/2}, \end{aligned}$$

and from this we deduce the following result.

**Theorem 5.1 (Liu 1990)** *The numerical values of the entries in column j>k of L depend on the numerical values in column k of L if and only if ljk* = 0*.*

The theoretical background of the previous chapter based on the elimination tree T enables the dependencies in Theorem 5.1 to be searched for efficiently. In particular, T allows the row (or column) counts of *L* to be computed and they can be used to allocate storage for *L*. It can also be used to find supernodes and the resulting (block) elimination tree can then be employed to determine the (block) column structure of *L*. In practice, it can be beneficial to split large supernodes into smaller panels to better conform to computer caches.

Algorithms 5.5 and 5.6 are simplified sparse left- and right-looking Cholesky factorization algorithms that are straightforward sparse variants of Algorithms 5.1 and 5.4, respectively (the latter with *nb* = *n*, that is, without considering blocks). Here, we assume that the sparsity pattern of *L* has already been determined in the symbolic phase and static storage formats based, for example, on compressed columns and/or rows are used.

## **ALGORITHM 5.5 Simplified sparse left-looking Cholesky factorization Input:** SPD matrix *A* and sparsity pattern S{*L*}. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* . 1: *lij* = *aij* for all *(i, j )* ∈ S{*L*} Filled entries in *L* are initialised to zero 2: **for** *j* = 1 : *n* **do** 3: **for** *k* ∈ {*k<j* | *ljk* = 0} **do** 4: **for** *i* ∈ {*i* ≥ *j* | *lik* = 0} **do** 5: *lij* = *lij* − *likljk* 6: **end for** 7: **end for** 8: *ljj* <sup>=</sup> *(ljj )*1*/*<sup>2</sup>

```
9: for i ∈ {i>j | lij 
= 0} do
10: lij = lij / ljj
11: end for
```
12: **end for**

#### **ALGORITHM 5.6 Simplified sparse right-looking Cholesky factorization**

**Input:** SPD matrix *A* and sparsity pattern S{*L*}. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .

1: *lij* = *aij* for all *(i, j )* ∈ S{*L*} Filled entries in *L* are initialised to zero 2: **for** *j* = 1 : *n* **do** 3: *ljj* <sup>=</sup> *(ljj )*1*/*<sup>2</sup> 4: **for** *i* ∈ {*i>j* | *lij* = 0} **do** 5: *lij* = *lij / ljj* 6: **end for** 7: **for** *k* ∈ {*k>j* | *lkj* = 0} **do** 8: **for** *i* ∈ {*i* ≥ *k* | *lij* = 0} **do** 9: *lik* = *lik* − *lij lkj* 10: **end for** 11: **end for** 12: **end for**

An alternative for sparse matrices held in row-wise format is to compute *L* one row at a time. This is sometimes called an **up-looking** factorization because rows 1 to *i* − 1 are employed to compute row *i* (*i >* 1). The approach is asymptotically optimal in the work performed and for highly sparse matrices it is potentially extremely efficient because the entries of *A* are used in the natural order in which they are stored. However, it is difficult to incorporate high level BLAS.

The following relation holds for the *i*-th row of *L*

$$L\_{l,1:l-1}^T = L\_{1:l-1,1:l-1}^{-1} A\_{1:l-1,l} \quad \text{with} \quad l\_{ll}^2 = a\_{ll} - L\_{l,1:l-1} L\_{l,1:l-1}^T.$$

The application of *L*−<sup>1</sup> <sup>1</sup>:*i*−1*,*1:*i*−<sup>1</sup> can be implemented by solving the triangular system

$$L\_{1:l-1,1:l-1} \mathbf{y} = A\_{1:l-1,l} \mathbf{y}$$

and setting *L<sup>T</sup> i,*1:*i*−<sup>1</sup> <sup>=</sup> *<sup>y</sup>*. The following result can be used to determine the sparsity pattern of *y*.

**Theorem 5.2 (Gilbert 1994)** *Consider a sparse lower triangular matrix L and the DAG* <sup>G</sup>*(L<sup>T</sup> ) with vertex set* {1*,* <sup>2</sup>*,...,n*} *and edge set* {*(j* −→ *i)* <sup>|</sup> *lij* <sup>=</sup> <sup>0</sup>}*. The sparsity pattern* S{*y*} *of the solution y of the system Ly* = *b is the set of all vertices reachable in* <sup>G</sup>*(L<sup>T</sup> ) from* <sup>S</sup>{*b*}*.*

*Proof* From Algorithm 3.4 and assuming the non-cancellation assumption, we see that (a) if *bi* = 0, then *yi* = 0 and (b) if for some *j<i*, *yj* = 0 and *lij* = 0, then *yi* = 0. These two conditions can be expressed as a graph transversal problem in <sup>G</sup>*(L<sup>T</sup> )*. (a) adds all vertices in <sup>S</sup>{*b*} to the set of visited vertices and (b) states that if vertex *<sup>j</sup>* has been visited, then all its neighbours in <sup>G</sup>*(L<sup>T</sup> )* are added to the set of visited vertices. Thus S{*y*} = R*each(*S{*b*}*)* ∪ S{*b*}.

Figure 5.1 illustrates the sparsity patterns of a lower triangular matrix *L* and vector *<sup>b</sup>* together with <sup>G</sup>*(L<sup>T</sup> )*. The vertices that are reachable from <sup>S</sup>{*b*}={2*,* <sup>4</sup>} are 5 and 6 and thus S{*y*}={2*,* 4*,* 5*,* 6}.

Algorithm 5.7 outlines a sparse row Cholesky factorization that is based on the repeated solution of triangular linear systems. Theorem 5.2 can be used to determine the sparsity pattern of row *i* at Step 3, that is, by finding all the vertices that are reachable in <sup>G</sup>*(L<sup>T</sup>* <sup>1</sup>:*j*−1*,*1:*j*−1*)* from the set {*<sup>i</sup>* <sup>|</sup> *aij* <sup>=</sup> <sup>0</sup>*,i<j* }. A depth-first search

**Figure 5.1** An example to illustrate *<sup>L</sup>*, *<sup>b</sup>* and <sup>G</sup>*(LT )*.


of <sup>G</sup>*(L<sup>T</sup>* <sup>1</sup>:*j*−1*,*1:*j*−1*)* determines the vertices in the row sparsity patterns in topological order, and performing the numerical solves in that order correctly preserves the numerical dependencies. Alternatively, because nonzeros of *Li,*<sup>1</sup>:*i*−<sup>1</sup> correspond to the vertices in the *i*-th row subtree T*r(i)* that are not equal to *i*, another option is to find the row subtrees using T *(A)*.

#### **5.3 Supernodal Sparse Cholesky Factorizations**

The simplified schemes form the basis of sophisticated supernodal algorithms that are designed to be efficient in parallel computational environments. Consider the right-looking variant and recall that a supernode consists of one or more consecutive columns of *L* with the same sparsity pattern. These nonzeros are stored as a dense trapezoidal matrix (only the lower triangular part of the block on the diagonal needs to be stored and the rows of zeros in the columns of the supernode are not held). This is termed a **nodal matrix** (see Figure 5.2).

Once a supernode is ready to be factorized, a dense Cholesky factorization of the block on the diagonal of the nodal matrix is performed (one of the approaches of Section 5.1 can be used). Then a triangular solve is performed with the computed factor and the rectangular part of the nodal matrix. The next step is to iterate over ancestors of the supernode in the assembly tree. For each parent, the rows of the current supernode corresponding to the parent's columns are identified, and then the outer product of those rows and the part of the supernode below those columns formed (update operations). The resulting matrix can be held in a temporary buffer. The rows and columns of this buffer are matched against indices of the ancestors and are added to them in a sparse scatter operation. For efficiency, the updates may use panels so that the temporary buffer remains in cache.

**Figure 5.2** An illustration of a supernode (left), the corresponding nodal matrix (centre), and the nodal matrix with two panels (right). The shaded lower triangular part of the block on the diagonal and the shaded block rows are treated as dense.

#### *5.3.1 DAG-Based Approach*

The DAG-based approach can also be extended to the sparse case. Each nodal matrix is subdivided into blocks. The factorization is split into tasks in which a single block is revised. The key difference compared to the dense case is that it is necessary to distinguish between two types of update operations: **update\_internal** performs the update between blocks in the same nodal matrix and **update\_between** performs the update when the blocks belong to different nodal matrices. Thus the sparse Cholesky factorization is split into the following tasks; the first two are illustrated in Figure 5.3. In this example, the nodal matrix has two block columns that do not contain the same number of columns.


Again, the tasks are partially ordered and a task DAG is used to capture the dependencies. For example, the updating of a block of a nodal matrix from a block

**Figure 5.3** An illustration of a blocked nodal matrix with two block columns. The first block on the diagonal is triangular and the second one is trapezoidal. The task **factorize\_block** is illustrated on the left and in the centre; the task **solve\_block** is illustrated on the right.

column of *L* that is associated with a descendant of the supernode has to wait until all the relevant rows of the block column are available. At each stage of the factorization, tasks will be executing (in parallel) while others are held (in a stack or pool of tasks) ready for execution.

#### **5.4 Multifrontal Method**

The **multifrontal** method is an alternative way to compute a sparse Cholesky factorization. To discuss this popular approach, we use the following result that determines which rows and columns influence particular Schur complements using the terminology of the elimination tree.

**Theorem 5.3 (Liu 1990)** *Let A be SPD and let* T *be its elimination tree. The numerical values of entries in column k of the Cholesky factor L of A only affect the numerical values of entries in column i of L for i* ∈ *anc*<sup>T</sup> {*k*} *(*1 ≤ *k<i* ≤ *n* − 1*).* ⎛⎞

*Proof* From (4.1), setting *<sup>S</sup>(*1*)* <sup>=</sup> *<sup>A</sup>*, for *<sup>k</sup>* <sup>≥</sup> <sup>2</sup> the *(n* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* <sup>×</sup> *(n* <sup>−</sup> *<sup>k</sup>* <sup>+</sup> <sup>1</sup>*)* Schur complement *S(k)* can be expressed as ⎜⎝⎟⎠

$$\begin{aligned} \text{for } & \text{from } (\star, 1), \text{ secure } \mathcal{S}^{(k)} = \mathcal{A}, \text{ for } \kappa \ge 2 \text{ and } (\mu - \kappa + 1) \times (\mu - \kappa + 1) \text{ such} \\\\ \text{emplement } \mathcal{S}^{(k)} \text{ can be expressed as} \\\\ \mathcal{S}^{(k)} = S\_{k:n, k:n}^{(k-1)} - \begin{pmatrix} l\_{k,k-1} \\ \vdots \\ l\_{n,k-1} \end{pmatrix} \begin{pmatrix} l\_{k,k-1} \dots \ l\_{n,k-1} \end{pmatrix} = S\_{k:n, k:n}^{(k-1)} - L\_{k:n, k-1} L\_{k:n, k-1}^T. \end{aligned} \tag{5.1}$$

Theorem 4.2 implies that all nonzero off-diagonal entries *lik* in column *k* of *L* explicitly used in the update (5.1) are such that *i* ∈ *anc*<sup>T</sup> {*k*}. Considering the Cholesky factorization as a sequence of Schur complement updates, only columns *i* with *i* ∈ *anc*<sup>T</sup> {*k*} can be influenced numerically by the Schur complement update in the *k*-th step of the factorization, and the result follows.

The computation of subsequent Schur complements by adding individual updates as in (5.1) is straightforward; the multifrontal method employs further modifications and enhancements of this basic concept. First, because the vertices of T are topologically ordered, the order in which the updates are applied progresses up the tree from the leaf vertices to the root vertex. This allows the computation of *S(k)* to be rewritten as *<sup>S</sup>(k)* <sup>=</sup> *Ak*:*n,k*:*<sup>n</sup>* <sup>−</sup> 

$$S^{(k)} = A\_{k:n,k:n} - \sum\_{j \in \mathcal{T}(k) \backslash \{k\}} L\_{k:n,j} L\_{k:n,j}^T,$$

emphasizing the role of T . In place of Schur complements, the multifrontal method uses frontal matrices connected to subtrees of T . Assume *k, k*1*,...,kr* are the row indices of the nonzeros in column *k* of *L*. The **frontal matrix** *Fk* of the *k*-th subtree T *(k)* of T is the dense *(r* + 1*)* × *(r* + 1*)* matrix defined by ⎛⎜⎜⎜⎞⎟⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟ 

$$\begin{aligned} \text{(i) of } \mathcal{T} \text{ is the dense } (r+1) \times (r+1) \text{ matrix defined by} \\ F\_k = \begin{pmatrix} a\_{kk} & a\_{kk\_1} \dots a\_{kk\_r} \\ a\_{k\_1k} & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ a\_{k\_rk} & 0 & \dots & 0 \end{pmatrix} - \sum\_{j \in \mathcal{T}(k) \backslash \{k\}} \begin{pmatrix} l\_{kj} \\ l\_{k\_1j} \\ \vdots \\ l\_{k\_rf} \end{pmatrix} \begin{pmatrix} l\_{kj} \ l\_{k\_1j} \dots \ l\_{k\_rf} \end{pmatrix} .\end{aligned} \tag{5.2}$$

One step of the Cholesky factorization of *Fk* can be written as ⎜⎜⎜⎟⎟⎟⎝⎠⎜⎜⎜

$$F\_k = \begin{pmatrix} l\_{kk} & 0 & \dots & 0 \\ l\_{k1k} & & & \\ \vdots & & I & \\ l\_{k,k} & & & \end{pmatrix} \begin{pmatrix} 1 & & \\ & & \\ & & V\_k \end{pmatrix} \begin{pmatrix} l\_{kk} \ l\_{k,k} \ \dots \ l\_{k,k} \\ 0 \\ \vdots & & I \\ 0 \end{pmatrix} \tag{5.3}$$

⎟

⎟

⎟

$$\mathbf{u} = \begin{pmatrix} l\_{kk} \\ l\_{k\_1k} \\ \vdots \\ l\_{k\_rk} \end{pmatrix} \begin{pmatrix} l\_{kk} \ l\_{k\_1k} \ \dots \ l\_{k\_rk} \end{pmatrix} + \begin{pmatrix} 0 \\ & \\ & V\_k \end{pmatrix},\tag{5.4}$$

where *Vk* is termed a **generated element** (it is also sometimes called an **update matrix** or a **contribution block**). The name "generated element" is because the multifrontal method has its origins in the simpler **frontal method**, which uses a single frontal matrix. The frontal method was originally proposed for problems arising in finite element problems to avoid the need to explicitly construct the system matrix *A*; it was later generalized to non-element problems. It works with a single frontal matrix and has less scope for parallelisation compared to the multifrontal method; it is no longer widely used.

Equating the last *r* rows and columns in (5.2) and (5.4) yields ⎜⎝⎟⎠

⎝

⎛

⎞

$$\text{rows and columns in } (5.2) \text{ and } (5.4) \text{ yields}$$

$$V\_k = -\sum\_{j \in \mathcal{T}(k)} \begin{pmatrix} l\_{k\_1 j} \\ \vdots \\ l\_{k\_r j} \end{pmatrix} \begin{pmatrix} l\_{k\_1 j} & \dots & l\_{k\_r j} \end{pmatrix} . \tag{5.5}$$

Assume that *cj* (*j* = 1*,...,s*) are the children of *k* in T . The set T *(k)* \ {*k*} is the union of disjoint sets of vertices in the subtrees T *(cj )*. Each of these subtrees is represented in the overall update by the generated element (5.5). Thus, *Fk* can be written in an recursive form using the generated elements of the children of *k* as follows ⎛⎜⎜⎜⎞⎟⎟⎟

⎠

$$F\_k = \begin{pmatrix} a\_{kk} \ a\_{kk\_1} \dots a\_{kk\_r} \\ a\_{k1k} \ 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ a\_{k,k} \ 0 & \dots & 0 \end{pmatrix} \oplus V\_{c\_1} \oplus \dots \oplus V\_{c\_r}.\tag{5.6}$$

Here, the operation ←→ denotes the addition of matrices that have row and column indices belonging to subsets of the same set of indices (in this case, *k, k*1*,...,kr*); entries that have the same row and column indices are summed. This is referred to as the **extend-add operator**.

Adding a row and column of *A* and the generated elements into a frontal matrix is called the **assembly**. A variable is **fully summed** if it is not involved in any rows and columns of *A* that have still to be assembled or in a generated element. Once a variable is fully summed, it can be eliminated. A key feature of the multifrontal method is that the frontal matrices and the generated elements are compressed and stored without zero rows and columns as small dense matrices. Integer arrays are used to maintain a mapping of the local contiguous indices of the frontal matrices to the global indices of *A* and its factors. Symmetry allows only the lower triangular part of these matrices to be held. Algorithm 5.8 outlines the basic multifrontal method.

### **ALGORITHM 5.8 Basic multifrontal Cholesky factorization Input:** SPD matrix *A* and its elimination tree. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .



**ALGORITHM 5.9 Multifrontal Cholesky factorization using the assembly tree Input:** SPD matrix *A* and its assembly tree. **Output:** Factor *<sup>L</sup>* such that *<sup>A</sup>* <sup>=</sup> *LL<sup>T</sup>* .


We have the following observation.

**Observation 5.1** *Each generated element Vk is used only once to contribute to a frontal matrix Fparent (k). Furthermore, the index list for the frontal matrix Fk is the set of row indices of the nonzeros in column k of the Cholesky factor L.*

In practical implementations, efficiency is improved by using the assembly tree (Section 4.6) because it allows more than one elimination to be performed at once. This is outlined in Algorithm 5.9. Here *kb* is used to index the frontal matrix on the *kb*-th step (1 ≤ *kb* ≤ *nsup*).

As an example, consider the matrix and its assembly tree given in Figure 4.10. The *nsup* = 5 supernodes are {1*,* 2}*,* 3*,* 4*,* 5*,*{6*,* 7*,* 8*,* 9} and so variables 1 and 2 can be eliminated together on the first step. Assembling rows/columns 1 and 2 of the original matrix, the frontal matrix *F*<sup>1</sup> and generated element *V*<sup>1</sup> have the structure ⎛⎞

⎟

$$\begin{array}{ccccccccc} & & & & & \\ & & & & & \\ & & & 1 & 2 & 8 & 9 \\ & & & & & 1 & 2 & 8 \\ F\_1 = & 2 & \begin{pmatrix} \* & \* & \* & \* & \* \\ \* & \* & \* & \* & \* \\ \* & \* & \* & & \\ \* & \* & \* & & \end{pmatrix}, & V\_1 = & \begin{matrix} 8 & 9 \\ 5 & -1 \end{matrix} \\ \end{array}$$

where *f* denotes fill-in entries (only the lower triangular entries are stored in practice). Similarly, ⎛⎞⎠

$$\begin{array}{ccccc} 3 & 4 & 8\\ 3 & \left( \* & \* & \* \\ F\_{2} = 4 & \begin{pmatrix} \* & \* & \* \\ \* & \* & \* \\ \* & \* & \* \end{pmatrix} \right) , \quad V\_{2} = \frac{4}{8} \begin{pmatrix} \* & \* \\ \* & \* \end{pmatrix} . \end{array}$$

The frontal matrix *F*<sup>3</sup> and generated element *V*<sup>3</sup> are given by

⎝

⎜

#### 5.5 Parallelism Within Sparse Cholesky Factorizations 85 ⎛⎞

⎝

⎛

⎝

⎜

*F*<sup>3</sup> = 478 4 ∗∗∗ 7 ∗ ∗ 8 ∗ ∗ ←→ *V*2*, V*<sup>3</sup> = 7 8 7 ∗ *f* 8 *f* ∗ *.*

⎞

⎠

⎠

Then

$$F\_4 = \begin{array}{cccc} & \mathbf{s} & \mathbf{7} & \mathbf{8} \\ & \mathbf{3} & \begin{pmatrix} \* & \* & \* \\ \* & \* & \* \\ \* & \* & \* \end{pmatrix}, & V\_4 = \begin{array}{ccc} & \mathbf{7} & \mathbf{s} \\ & \mathbf{8} & \begin{pmatrix} \* & f \\ f & \* \end{pmatrix} . \end{array}$$

⎞

⎟

and, finally, with *kb* = 5 we have ⎛

$$F\_5 = \begin{pmatrix} 6 & 7 & 8 & 9 \\ \* & \* & \* & \* \\ 8 & \begin{pmatrix} \* & \* & \* \\ \* & & \* \\ \* & \* & \* \end{pmatrix} & \bigoplus {\*} V\_4 \xleftrightarrow{\bigoplus} V\_3 \xleftrightarrow{\bigoplus} V\_1... $$

An important implementation detail is how and where to store the generated elements. The partial factorization of *Fkb* at supernode *kb* can be performed once the partial factorizations at all the vertices belonging to the subtree of the assembly tree with root vertex *kb* are complete. If the vertices of the assembly tree are ordered using a depth-first search, the generated elements required at each stage are the most recently computed ones amongst those that have not yet been assembled. This makes it convenient to use a stack. This affects the order in which the variables are eliminated but in exact arithmetic, the results are identical.

Nevertheless, the memory demands of the multifrontal method can be very large. Not only is it dependent on the initial ordering of *A* but the ordering of the children of a vertex in the assembly tree can significantly affect the required stack size. Some implementations target limiting stack storage requirements. An attractive feature of the multifrontal method is that the generated elements can be held using auxiliary storage (in files on disk) to restrict the in-core memory requirements, allowing larger problems to be solved than would otherwise be possible.

#### **5.5 Parallelism Within Sparse Cholesky Factorizations**

Sparse Cholesky factorizations use supernodes and task graphs (the assembly tree for the multifrontal method) to control the computation. The number of rows and columns in a supernode typically increases away from the leaf vertices and towards the root of the task graph because a supernode accumulates fill-in from its ancestors in the task graph. As a result, tasks that are relatively close to the root tend to have more work associated with them. On the other hand, the width of the task graph shrinks close to the root. In other words, a typical task graph for sparse matrix factorization tends to have a large number of small independent tasks close to the leaf vertices, but a small number of large tasks close to the root. An ideal parallelization strategy that would match the characteristics of the problem is as follows. Initially, assign the relatively plentiful independent tasks at or near the leaf vertices to parallel threads or processes. This is called **task** or **tree level** parallelism; it is influenced by the ordering of *A*. As tasks complete, other tasks become available and are scheduled similarly. This continues while there are enough independent tasks to keep all the threads or processes busy. When the number of available parallel tasks becomes too small, the only way to keep the latter busy is to assign more than one to a task. This is termed **node level** parallelism. The number of threads or processes working on individual tasks should increase as the number of parallel tasks decreases. Eventually, all threads or processes are available to work on the root task. The computation corresponding to the root task is equivalent to factoring a dense matrix of the size of the root supernode.

The multifrontal method is often the formulation of choice for highly parallel implementations of sparse matrix factorizations. This is because of its natural data locality (most of the work of the factorization is performed in the dense frontal matrices) and the ease of synchronization that it permits. In general, each supernode is updated by multiple other supernodes and it can potentially update many other supernodes during the course of the factorization. If implemented naively, all these updates may require excessive locking and synchronization in a shared-memory environment or generate excessive message-traffic in a distributed environment. In the multifrontal method, the updates are accumulated and channelled along the paths from the leaf vertices of the assembly tree to its root vertex. This gives a manageable structure to the potentially haphazard interaction among the tasks.

In Section 1.2.4, bit compatibility was discussed. While different orderings of the children of a vertex in the assembly tree do not affect the total number of floatingpoint operations that are performed in the multifrontal method, in finite-precision arithmetic changing the order of the assemblies into the frontal matrices can lead to slightly different results. Given that the number of children is typically small and that large matrices can be partitioned such that summations can be safely performed in parallel, the overhead in the multifrontal method of enforcing a defined order of the summation is relatively small. By contrast, in the supernodal approach, for each data block a number of matrices equal to the block dependencies are summed. Given the relatively large numbers (several thousand) for many nodes, an enforced order may be detrimental to efficiency.

#### **5.6 Notes and References**

Exploiting panels and blocks in both left- and right-looking Cholesky factorization algorithms is extremely important. The development of sparse supernodal factorizations for uniprocessors and multiprocessors in the 1990s is discussed by Ng & Peyton (1993a,b); Rothberg & Gupta (1993) presents an early comparison of various types of block Cholesky factorizations. PaStiX of Hénon et al. (2002) is a parallel left-looking supernodal solver that is primarily designed for positive definite systems. Rotkin & Toledo (2004) introduce a hybrid left-looking/right-looking algorithm and Rozin & Toledo (2005) show that no sparse numerical factorization is uniformly better than the others. An up-looking approach, which is fast in practice for very sparse matrices, is employed in the widely used CHOLMOD solver of Chen et al. (2008). The package HSL\_MA87 implements a sparse DAG-based Cholesky factorization for shared-memory architectures; further details of the approach can be found in Hogg et al. (2010).

The multifrontal algorithm has its origins in the simpler frontal method of Irons (1970), which was developed by the civil engineering community from the 1960s onwards to solve the linear systems that arise within finite element methods. At a time when the main memory of even the most powerful computers was extremely limited, the frontal method was heavily influenced by the need to minimize the memory requirements of the linear solver. It was initially designed for SPD banded linear systems and was subsequently extended to nonsymmetric problems by Hood (1976) and to the symmetric indefinite case by Reid (1981); Duff (1984) generalizes the approach to non-element problems. The frontal method proceeds by alternating the assembly of the finite elements into a single dense frontal matrix with the elimination and update of variables. Once variables have been eliminated they are no longer needed during the factorization and so they are removed from the frontal matrix and stored elsewhere (for example, not in main memory but on an external disk) until needed during the solve phase. This frees up space to accommodate the next element to be assembled. Because the frontal method does not use the assembly tree, the frontal matrix can be much larger than those in the multifrontal method, leading to higher operation counts but also allowing the use of BLAS with larger block sizes. Efficient implementations were developed up until the late 1990s. For example, by Duff & Scott (1996, 1999), who provide a package MA62 for SPD problems in element form that employs a single array of length *n*, exploits Level 3 BLAS, and holds the computed factors on disk; a coarse-grained parallel version is also available, see Duff & Scott (1994) and Scott (2001).

The frontal method and the work of Speelpenning (1978) on the so-called generalized element method led to the development by Duff & Reid (1983) of the multifrontal method for solving general symmetric systems (including systems in element form). A detailed matrix-based explanation is given in Liu (1992). The method is implemented in some of the most important sparse direct solvers. The MUMPS (2022) package, which has been actively developed over many years, provides a state-of-the-art distributed memory general-purpose multifrontal solver that uses shared-memory parallelism within each MPI process. Other important parallel multifrontal solvers are HSL\_MA97 (Hogg & Scott, 2013b) and WSMP (2020), while the serial package MA57 of Duff (2004) (which superseded the original and perhaps most well-known multifrontal solver MA27 of Duff & Reid, (1983)) remains very popular. An attractive feature of HSL\_MA97 is that it computes bit-compatible solutions. HSL\_MA77 of Reid & Scott (2009) is designed to minimize memory requirements by allowing the factors and the multifrontal stack to be efficiently held outside of main memory (an option that is also offered by MUMPS). In common with earlier frontal solvers, HSL\_MA77 allows the user to input the system matrix in element form (that is, *A* is not explicitly assembled for problems coming from finite element applications but is input one element at a time).

The use of GPUs is well-suited to a multifrontal or supernodal factorization because these approaches rely on regular block computations within dense submatrices. Implementing the multifrontal method (including for symmetric indefinite matrices) on GPU architectures is discussed in Hogg et al. (2016), while Lacoste et al. (2012) and Rennich et al. (2016) present GPU-accelerated supernodal factorizations. Discussion of the use of GPUs within direct solvers is included in the comprehensive survey of Davis et al. (2016).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 6 Sparse LU Factorizations**

*The closer one looks, the more subtle and remarkable Gaussian elimination appears – Trefethen (1985)*

*Gaussian elimination is living mathematics. It has mutated successfully for the last two hundred years to meet changing social needs – Grcar (2011)*

This chapter considers the LU factorization of a general nonsymmetric nonsingular sparse matrix *A*. In practice, numerical pivoting for stability and/or ordering of *A* to limit fill-in in the factors is often needed and the computed factorization is then of a permuted matrix *P AQ*. Pivoting is discussed in Chapter 7 and ordering algorithms in Chapter 8.

## **6.1 Sparse LU Factorizations and Their Graph Models**

In Chapter 4, graphs were used to describe structural changes during a sparse Cholesky factorization. In particular, the elimination tree was shown to play a key role and, in the previous chapter, the use of DAGs was discussed. For general matrices, there are a number of ways that graphs can be employed.

## *6.1.1 Use of Elimination DAGs*

The first graph model uses the elimination DAGs associated with *L* and *U* that were defined in (2.1)–(2.2). The following observation, which is illustrated in Figure 6.1, generalizes Observation 4.1 to nonsymmetric matrices.

**Observation 6.1** *If i>j and uj i* = 0*, then the* **column replication principle** *states*


**Figure 6.1** An illustration of the column and row replication principles of sparse LU factorizations. The matrix *A* is on the left. In the centre, we show in red the filled entries in *L* resulting from the replication of the first column in the second column because *u*<sup>12</sup> = 0. On the right, we show in blue the filled entries in *U* resulting from the replication of the second row in the third row because *l*<sup>32</sup> = 0. Other filled entries resulting from subsequent steps of the factorization are denoted in black.

$$
\mathcal{S}\{L\_{i:n,j}\} \subseteq \mathcal{S}\{L\_{i:n,i}\},
$$

*that is, the pattern of column j of L (rows i to n) is replicated in the pattern of column i of L. Analogously, if i>j and lij* = 0*, then the* **row replication principle** *states*

$$
\mathcal{S}\{U\_{j,i:n}\} \subseteq \mathcal{S}\{U\_{i,i:n}\},
$$

*that is, the pattern of row j of U (columns i to n) is replicated in the pattern of row i of U.*

Algorithm 6.1 outlines a basic sparse LU factorization. Here it is assumed that *A* is factorizable so that pivoting is not needed. The remainder of this chapter looks at techniques that can be used to develop the approach into an efficient one.

The following theorem formulates the recursive column replication and the replication of nonzeros along rows of *L* using directed paths in G*(U )*; an analogous result holds for the rows of *<sup>U</sup>* and directed paths in <sup>G</sup>*(L<sup>T</sup> )*.

**Theorem 6.1 (Gilbert & Liu 1993)** *Assume that for some k<j there is a directed path <sup>k</sup>* <sup>G</sup>*(U )* ⇒ *<sup>j</sup> . Then*

$$\mathcal{S}\{L\_{j:n,k}\} \subseteq \mathcal{S}\{L\_{j:n,j}\}.\tag{6.1}$$

*Moreover, if lik* = 0 *for some i>j , then lis* = 0 *for all vertices s on this path.*

The next two theorems generalize Theorem 4.3 to *A* being a general nonsymmetric matrix.

**Theorem 6.2 (Gilbert & Liu 1993)** *If aij* = 0 *and i>j , then there is a filled entry lij* = 0 *if and only if there exists k<j such that aik* = 0 *and there is a directed path <sup>k</sup>* <sup>G</sup>*(U )* ⇒ *<sup>j</sup> .*

#### **ALGORITHM 6.1 Basic sparse LU factorization**

**Input:** Nonsymmetric and factorizable matrix *A* = *LA* + *DA* + *UA*. **Output:** LU factorization *A* = *LU*.

1: *L* = *I* + *LA* Identity plus strictly lower triangular part of *A* 2: *U* = *DA* + *UA* Diagonal plus strictly upper triangular part of *A* 3: **for** *k* = 1 : *n* − 1 **do** 4: **for** *i* ∈ {*i>k* | *lik* = 0} **do** 5: *lik* = *lik/ukk* 6: *Ui,i*:*<sup>n</sup>* = *Ui,i*:*<sup>n</sup>* − *Uk,i*:*nlik* Update row *i* of *U* 7: **end for** 8: **for** *j* ∈ {*j>k* | *ukj* = 0} **do** 9: *Lj*+1:*n,j* = *Lj*+1:*n,j* − *Lj*+1:*n,kukj* Update column *j* of *L* 10: **end for** 11: **end for**

**Theorem 6.3 (Gilbert & Liu 1993)** *If aij* = 0 *and i<j , then there is a filled entry uij* = 0 *if and only if there exists k<i such that akj* = 0 *and there is a directed path <sup>k</sup>* <sup>G</sup>*(LT )* ⇒ *<sup>i</sup>.*

Theorems 6.2 and 6.3 are demonstrated in Figure 6.2. Consider the directed path 1 → 3 → 5 → 6 in G*(U )*. Existence of this path implies the fill-in in *L*, first in

**Figure 6.2** The sparsity patterns of *A* (left) and *L*+*U* (right) together with the graphs G*(A)* (left), <sup>G</sup>*(LT )* (centre) and <sup>G</sup>*(U )* (right). The filled entries are denoted by *<sup>f</sup>* and the corresponding edges are the red dashed lines.

**Figure 6.3** Example to show the transitive reduction of a DAG. G is on the left, its transitive reduction <sup>G</sup><sup>0</sup> is in the centre, and one possible <sup>G</sup> that is equireachable with <sup>G</sup> is on the right.

column 3, then in columns 5 and 6. Similarly, the directed path 2 → 4 → 5 → 6 in <sup>G</sup>*(L<sup>T</sup> )* implies fill-in at positions *(*4*,* <sup>7</sup>*)*, *(*5*,* <sup>7</sup>*)* and *(*6*,* <sup>7</sup>*)* in *<sup>U</sup>*.

### *6.1.2 Transitive Reduction and Equireachability*

To employ <sup>G</sup>*(L<sup>T</sup> )* and <sup>G</sup>*(U )* in efficient algorithms, they need to be simplified. One possibility is to use transitive reductions that are sparser and preserve reachability within the graphs. A subgraph <sup>G</sup><sup>0</sup> <sup>=</sup> *(*V*,* <sup>E</sup>0*)* is a **transitive reduction** of <sup>G</sup> <sup>=</sup> *(*V*,* E*)* if the following conditions hold:


A transitive reduction is unique for a DAG, as shown in the following theorem and illustrated in Figure 6.3.

**Theorem 6.4 (Aho et al. 1972)** *Let* <sup>G</sup> *be a DAG. The transitive reduction* <sup>G</sup><sup>0</sup> *of* <sup>G</sup> *is unique and is the subgraph that has an edge for every path in* G *and has no proper subgraph with this property.*

If S{*A*} is symmetric, then, as illustrated in Figure 6.4, the role of the transitive reduction is played by the elimination tree.

**Theorem 6.5 (Liu 1990; Eisenstat & Liu 2005a)** *If A is symmetrically structured, then the transitive reduction of the DAG* <sup>G</sup>*(LT ) (=* <sup>G</sup>*(U )) is the elimination tree* T *(A).*

Obtaining the exact transitive reduction of a DAG can be expensive. Instead, approximate reductions that drop the minimality condition may be computed. A directed graph G with the same vertex set as G that satisfies condition *(T* 1*)* is said

**Figure 6.4** The sparsity patterns of *L*+*U* of a symmetrically structured *A* together with the DAG <sup>G</sup>*(LT )* (left) and the elimination tree <sup>T</sup> *(A)* (right). The filled entries are denoted by *<sup>f</sup>* and the corresponding edges are the red dashed lines. It is straightforward to see that T *(A)* is obtained as the transitive reduction of <sup>G</sup>*(LT )*.

to be **equireachable** with G. The next result is a simplification of Theorem 6.1; an analogous result holds for the sparsity patterns of the rows of *U*.

**Theorem 6.6 (Gilbert & Liu 1993)** *Assume* G *is equireachable with* G*(U ) and for some k<j there is a directed path k* <sup>G</sup> ⇒ *j . Then (6.1) holds. Moreover, if lik* = 0 *for some i>j , then lis* = 0 *for all vertices s on the directed path.*

Equireachability enables sparse triangular linear systems to be solved more efficiently. In Chapter 5, Theorem 5.2 describes how to obtain the sparsity pattern <sup>J</sup> of the solution of a lower triangular system using paths in <sup>G</sup>*(LT )*. This graph can be replaced by any graph that is equireachable with <sup>G</sup>*(LT )*. Equireachability also allows Theorems 6.2 and 6.3 to be rewritten using paths in a graph G that is equireachable with G.

**Theorem 6.7 (Gilbert & Liu 1993)** *If aij* = 0 *and i>j , then there is a filled entry lij* = 0 *if and only if there exists k<j such that aik* = 0 *and a directed path k* <sup>G</sup> *(U )* ⇒ *<sup>j</sup> , where* <sup>G</sup> *(U ) is equireachable with* G*(U ).*

**Theorem 6.8 (Gilbert & Liu 1993)** *If aij* = 0 *and i<j , then there is a filled entry uij* = 0 *if and only if there exists k<i such that akj* = 0 *and a directed path k* <sup>G</sup> *(LT )* ⇒ *<sup>i</sup>, where* <sup>G</sup> *(L<sup>T</sup> ) is equireachable with* <sup>G</sup>*(L<sup>T</sup> ).*

Figure 6.5 depicts G*(U )* and G *(U )* for the matrix in Figure 6.2.

A description of the sparsity patterns of the columns of *L* can be obtained from the Schur complement (3.2) as follows:

**Figure 6.5** The DAG G*(U )* for the matrix from Figure 6.2 (left) and G *(U )* which is equireachable with G*(U )* (right). <sup>S</sup>{*Lj* :*n,j* } = <sup>S</sup>{*Aj* :*n,j* }

$$\mathcal{S}\{L\_{j;n,j}\} = \mathcal{S}\{A\_{j;n,j}\} \bigcup\_{k$$

Theorem 6.7 implies that not all the terms in this union are needed to obtain S{*Lj* :*n,j* }. This result is given in Theorem 6.9, which shows how S{*L*} can be computed by columns if G *(U )* that is equireachable with G*(U )* is known.

**Theorem 6.9 (Gilbert & Liu 1993)** *If* G *(U ) is equireachable with* G*(U ), then*

<sup>S</sup>{*Lj* :*n,j* } = <sup>S</sup>{*Aj* :*n,j* } *(k*→*j )*∈E*(*G *(U ))* S{*Lj* :*n,k*}*,* 1 ≤ *j* ≤ *n.* (6.2)

*Proof* Consider an edge *(k* → *j )* in G*(U )* but not in G *(U )*. Repeatedly applying (6.1) along the directed path *k* <sup>G</sup> *(U )* ⇒ *<sup>j</sup>* , we see that *Lj* :*n,k* is contained in the right-hand side of (6.2) and therefore S{*Lj* :*n,j* } is contained in the right-hand side of (6.2). Because the right-hand side of (6.2) is trivially contained in the left-hand side, the result follows.

An analogous result holds for the rows of *U*.

**Theorem 6.10 (Gilbert & Liu 1993)** *If* G *(L) is equireachable with* G*(L), then*

$$\begin{aligned} \text{plus result holds for the rows of } U. \\ \text{6.10 (GIibert \& \text{Lui } 1993) \quad If \mathcal{G}'(L) \text{ is equireacable with } \mathcal{G}(L) \\ \mathcal{S}\{U\_{l,l:n}\} = \mathcal{S}\{A\_{l,l:n}\} \underbrace{\bigcup\_{(k \to i) \in \mathcal{S}(\mathcal{G}'(L^T))} \mathcal{S}\{U\_{k,l:n}\}}\_{(k \to i) \in \mathcal{S}(\mathcal{G}'(L^T))}, \quad 1 \le i \le n. \end{aligned}$$

As an example of Theorem 6.9, consider the matrix in Figure 6.2. Because *(*3 → 5*)* is the only edge of G *(U )* in the union on the right-hand side of (6.2), S{*L*5:7*,*5} is given by

$$\mathcal{S}\{L\_{\ $?7,\$ }\} = \mathcal{S}\{A\_{\ $?7,\$ }\} \cup \mathcal{S}\{L\_{\ $?7,\$ }\}.$$

We can see this from the graph G *(U )* in Figure 6.5 (top right).

#### *6.1.3 Symbolic LU Factorizations Using DAGs*

Factorization by bordering can be used to obtain S{*L*} by rows and S{*U*} by columns. Assume the sparsity patterns of the first *k* − 1 rows of *L* and the first *k* − 1 columns of *U* (1 *< k* ≤ *n*) have been computed. At step *k*, the factors satisfy *Ak,*<sup>1</sup>:*k*−<sup>1</sup> *akk* 0 *ukk*

$$A\_{1:k,1:k} = \begin{pmatrix} A\_{1:k-1,1:k-1} \ A\_{1:k-1,k} \\ A\_{k,1:k-1} \end{pmatrix} = \begin{pmatrix} L\_{1:k-1,1:k-1} & 0 \\ L\_{k,1:k-1} & 1 \end{pmatrix} \begin{pmatrix} U\_{1:k-1,1:k-1} \ U\_{1:k-1,k} \\ 0 \end{pmatrix} . \tag{6.3}$$

Equating terms for the *(*2*,* 1*)* block, row *k* of *L* satisfies

$$L\_{k,1:k-1}U\_{1:k-1,1:k-1} = A\_{k,1:k-1},$$

or, equivalently, if *y* denotes the off-diagonal part of the column *k* of *L<sup>T</sup>* , then it is the solution of the lower triangular system

$$U\_{1:k-1,1:k-1}^T \mathcal{Y} = A\_{k,1:k-1}^T$$

From Theorem 5.2, the sparsity pattern of *y* is the set of all vertices reachable in the DAG G*(U*1:*k*−1*,*1:*k*−1*)* (or in a graph that is equireachable with it) from the nonzeros in *Ak,*<sup>1</sup>:*k*−1. Similarly, equating terms in (6.3) for the *(*1*,* 2*)* block, column *k* of *U* satisfies

$$L\_{1:k-1,1:k-1}U\_{1:k-1,k} = A\_{1:k-1,k}\cdot\bar{z}$$

Again, its sparsity pattern can be determined using Theorem 5.2 and the DAG <sup>G</sup>*(L<sup>T</sup>* <sup>1</sup>:*k*−1*,*1:*k*−1*)*. The diagonal entry *ukk* is then computed as *akk* <sup>−</sup>*Lk,*<sup>1</sup>:*k*−1*U*1:*k*−1*,k*. This shows that determining the sparsity patterns of *L* and *U* and computing their numerical values is coupled: computation of the factors needs be mutually interleaved because computing part of one requires information from a part of the other.

#### *6.1.4 Graph Pruning*

Consider the matrices in Figure 6.6. The one in the centre is the same as the one on the left except that the entries in positions *(*4*,* 6*)* and *(*6*,* 4*)* have been removed (that is, pruned). Both matrices have the same sets of reachable vertices in <sup>G</sup>*(LT )* and G*(U )*. This suggests how to find G *(L<sup>T</sup> )* and <sup>G</sup> *(U )* that are equireachable with <sup>G</sup>*(L<sup>T</sup> )* and <sup>G</sup>*(U )*, respectively.

**Theorem 6.11 (Eisenstat & Liu 1992)** *If for some j<s both lsj* = 0 *and uj s* = 0*, then there are no edges (j* → *k) with k>s in the transitive reductions of* G*(U ) and* <sup>G</sup>*(L<sup>T</sup> ).*


**Figure 6.6** An example of symmetric pruning. On the left is S{*L*+*U*}. In the centre is the reduced sparsity pattern obtained by symmetric pruning. On the right is the reduced sparsity pattern that results from symmetric path pruning.

*Proof* Let *(j* → *k)* be an edge of G*(U )*, that is, *ujk* = 0. Because *lsj* = 0 and *ujk* = 0 implies that *usk* = 0, there is a path *j* → *s* → *k* in G*(U )* and the edge *(j* <sup>→</sup> *k)* does not belong to the transitive reduction of <sup>G</sup>*(U )*. The result for <sup>G</sup>*(LT )* can be seen analogously.

This theorem implies that if for some *s >* 1 there are edges

$$j \xrightarrow{\mathcal{G}(L^{\overline{r}})} s \quad \text{and} \quad j \xrightarrow{\mathcal{G}(U)} s,$$

then all edges *(j* <sup>→</sup> *k)* in <sup>G</sup>*(U )* and <sup>G</sup>*(LT )* with *k>s* can be pruned. The resulting DAGs G *(U )* and G *(L<sup>T</sup> )* have fewer edges and are equireachable with <sup>G</sup>*(U )* and <sup>G</sup>*(L<sup>T</sup> )*, respectively. The removal of redundant edges based on Theorem 6.11 is called **symmetric pruning**.

There are other ways to perform pruning. For example, if for some *s >* 1 there are paths

$$j \xrightarrow{\mathcal{G}(L^{T})} s \quad \text{and} \quad j \xrightarrow{\mathcal{G}(U)} s,$$

then for all *k>s* **symmetric path pruning** removes the edges *(j* → *k)* from <sup>G</sup>*(U )* and <sup>G</sup>*(L<sup>T</sup> )*. Consider again Figure 6.6. In the centre is the sparsity pattern after symmetric pruning and on the right is the reduced sparsity pattern that results from symmetric path pruning. The edge *(*1 → 6*)* is not required in G *(L<sup>T</sup> )* or <sup>G</sup> *(U )* because there are paths

$$1 \xrightarrow{\mathcal{G}(L^{\top})} 2 \xrightarrow{\mathcal{G}(L^{\top})} 4 \xrightarrow{\mathcal{G}(L^{\top})} \mathfrak{S} \xrightarrow{\mathcal{G}(L^{\top})} 6 \quad \text{and} \quad 1 \xrightarrow{\mathcal{G}(U)} \mathfrak{S} \xrightarrow{\mathcal{G}(U)} 6.$$

**Figure 6.7** An example of the sparsity pattern of a nonsymmetric matrix *A* (left), S{*L* + *U*} with filled entries denoted by *f* (right) and its elimination tree.

#### *6.1.5 Elimination Trees for Nonsymmetric Matrices*

The elimination DAGs G*(L)* and G*(U )* can be combined into a single structure called the **nonsymmetric elimination tree** in which edges are replaced by paths. This can be advantageous because it is more compact. From (4.3), if S{*A*} is symmetric, then its elimination tree is defined in terms of the mapping

$$parent(j) = \min\{i \mid i > j \text{ and } l\_{ij} \neq 0\}.$$

The condition *lij* <sup>=</sup> 0 is equivalent to *<sup>i</sup>* <sup>G</sup>*(L)* −−−→ *<sup>j</sup>* <sup>G</sup>*(LT )* −−−→ *<sup>i</sup>*. In the nonsymmetric case, the definition can be generalized using directed paths

$$parent(j) = \min\{i \mid i > j \text{ and } i \stackrel{\mathcal{G}(L)}{\right\}} \stackrel{\mathcal{G}(U)}{\longrightarrow} j \stackrel{\mathcal{G}(U)}{\longrightarrow} i\text{)}.\tag{6.4}$$

This is illustrated in Figure 6.7. Vertices 6, 8, and 10 are the only ones with cycles of the form

$$i \xrightarrow{\mathcal{G}(L)} 2 \xrightarrow{\mathcal{G}(U)} i,$$

namely,

## **ALGORITHM 6.2 Basic computation of the elimination tree for nonsymmetric** *A* **Input:** Digraph G*(A)*.

**Output:** The elimination tree given by the mapping *parent*.

1: *parent (*1 : *n)* = 0 2: **for** *i* = 1 : *n* **do** 3: Find the vertex set V*<sup>C</sup>* of the strong component of G*(A*1:*i,*1:*i)* that contains *i* 4: **for** *j* ∈ V*<sup>C</sup>* \ {*i*} **do** 5: **if** *parent (j )* = 0 **then** 6: *parent (j )* = *i* 7: **end if** 8: **end for** 9: *parent (i)* = 0

10: **end for**

## <sup>6</sup> <sup>G</sup>*(L)* −−−→ <sup>2</sup> <sup>G</sup>*(U )* −−−→ <sup>5</sup> <sup>G</sup>*(U )* −−−→ <sup>6</sup>*,* <sup>8</sup> <sup>G</sup>*(L)* −−−→ <sup>2</sup> <sup>G</sup>*(U )* −−−→ <sup>8</sup> and <sup>10</sup> <sup>G</sup>*(L)* −−−→ <sup>6</sup> <sup>G</sup>*(L)* −−−→<sup>2</sup> <sup>G</sup>*(U )* −−−→10*.*

In this example, *parent (*2*)* = 6.

Theorem 6.12, which can be regarded as a generalization of Corollary 4.6, shows how the elimination tree for nonsymmetric *A* can be constructed.

**Theorem 6.12 (Eisenstat & Liu 2005a)** *Let A be a nonsymmetric matrix. i* = *parent (j ) if and only if i>j and i is the smallest vertex that belongs to the same strong component of* G*(A*1:*i,*1:*i) as vertex j .*

This result is employed in Algorithm 6.2. The complexity of finding the strong components of a digraph with *m* edges and *n* vertices is *O(n* + *m)*. Hence, the complexity of Algorithm 6.2 is *O(nz(A) n)*. More sophisticated approaches with complexity *O(nz(A)*log *n)* exist.

To illustrate Algorithm 6.2, consider the matrix and its elimination tree depicted in Figure 6.7. The main loop sets the first nonzero value in the array *parent* when *i* = 3 because this is the first *i* for which the set V*<sup>C</sup>* \ {*i*} is non empty; it is equal to {1} and thus *parent (*1*)* = *i* = 3. For *i* = 4, the vertex set {1*,* 3*,* 4} forms a strong component of G*(A*1:4*,*1:4*)* and so *parent (*3*)* = 4. For *i* = 5, the single vertex {5} is a strong component of G*(A*1:5*,*1:5*)* and, therefore, 5 is not a parent of any other vertex (it is a leaf vertex). G*(A*1:6*,*1:6*)* has two strong components with vertex sets {1*,* 3*,* 4} and {2*,* 5*,* 6}. *i* = 6 belongs to the second of these and thus the algorithm sets *parent (j )* = *i* = 6 for *j* = 2 and 5.

An attractive idea for constructing S{*L* + *U*} and subsequently computing the LU factorization is based on using the **column elimination tree** <sup>T</sup> *(AT A)*. 

**Theorem 6.13 (George & Ng 1985; Grigori et al. 2009)** *Assume all the diagonal entries of A are nonzero and let L L <sup>T</sup> be the Cholesky factorization of AT A. Then for any row permutation matrix P such that P A* = *LU the following holds:*

**Figure 6.8** The sparsity patterns of *<sup>A</sup>* and *<sup>L</sup>* <sup>+</sup> *<sup>U</sup>* (top) and of *<sup>A</sup><sup>T</sup> <sup>A</sup>* and *<sup>L</sup>* + *L <sup>T</sup>* , where *<sup>A</sup><sup>T</sup> <sup>A</sup>* <sup>=</sup> *L L <sup>T</sup>* (bottom). Filled entries are denoted by *f* . The corresponding elimination trees are also given. 

$$
\mathcal{S}\{L+U\} \subseteq \mathcal{S}\{\widehat{L} + \widehat{L}^T\}.
$$

An important feature of Theorem 6.13 is that it holds for *any* row permutation matrix *P* applied to *A*. This allows partial pivoting (Section 3.1.2) to be used. The following result states that <sup>T</sup> *(AT A)* represents the potential dependencies among the columns in an LU factorization and that for strong Hall matrices no tighter prediction is possible from the sparsity structure of *A*.

**Theorem 6.14 (Gilbert & Ng 1993)** *If P A* = *LU is any factorization of A with partial pivoting, then the following hold.*


Figure 6.8 illustrates the differences in the sparsity patterns of *A* and *AT A* and of their factors; the corresponding elimination trees are also given. This reveals a potential problem with the column elimination tree: <sup>S</sup>{*AT <sup>A</sup>*} can have significantly more entries than S{*L* + *U*}. An extreme example is when *A* has one or more dense rows because *AT A* is then fully dense.

### *6.1.6 Supernodes in LU Factorizations*

Supernodes group together columns of the factors with the same nonzero structure, allowing them to be treated as a dense submatrix for storage and computation. When solving SPD systems, supernodes can be determined during the symbolic phase. For nonsymmetric matrices, supernodes are harder to characterize. The need to incorporate pivoting means it may not be possible to predict the sparsity structures of the factors before the numerical factorization and they must be identified on-thefly. While there are several possible ways to define supernodes, the simplest (which is widely used in practice) follows the symmetric case and defines a supernode to be a set of contiguously numbered columns of *L* with the triangular diagonal block treated as dense and the columns as having the same structure below the diagonal block.

In a Cholesky solver, fundamental supernodes (Section 4.6.1) are made contiguous by symmetrically permuting the matrix according to a postordering of its elimination tree; this does not change the sparsity of the Cholesky factor. For nonsymmetric *<sup>A</sup>*, before the numerical factorization, <sup>T</sup> *(AT A)* can be constructed and the columns of *A* then permuted according to its postordering to bring together supernodes. The following result extends Theorem 4.9. 

**Theorem 6.15 (Li 1996)** *Let <sup>A</sup> have column elimination tree* <sup>T</sup> *(AT A). Let <sup>p</sup> be a permutation vector such that if pi is an ancestor of pj in* <sup>T</sup> *(AT A), then i>j . Let P be the permutation matrix corresponding to p and let A* <sup>=</sup> *P AP<sup>T</sup> . Then* T *(A <sup>T</sup> A) is isomorphic to* <sup>T</sup> *(AT A); in particular, relabelling each vertex <sup>i</sup> of* T *(A <sup>T</sup> A) as pi yields* <sup>T</sup> *(AT A). If, in addition, <sup>A</sup>* = *L U is an LU factorization without pivoting then <sup>P</sup><sup>T</sup> LP and <sup>P</sup><sup>T</sup> UP are lower triangular and upper triangular matrices, respectively, so that <sup>A</sup>* <sup>=</sup> *(P<sup>T</sup> LP )(P <sup>T</sup> UP) is also an LU factorization.*

In practice, for many matrices the average size of a supernode is only 2 or 3 columns and many comprise a single column. Larger artificial supernodes may be created by merging vertex *<sup>j</sup>* with its parent vertex *<sup>i</sup>* in <sup>T</sup> *(AT A)* if the subtree rooted at *i* has fewer than some chosen number of vertices.

#### **6.2 LU Multifrontal Method**

The multifrontal method (Section 5.4) can be generalized to nonsymmetric *A* by modifying the definitions of the frontal matrices and generated elements to conform to an LU factorization. But natural generalizations to rectangular frontal and generated element matrices do not simultaneously satisfy the statements from Observation 5.1. These statements can be rewritten for the LU factorization as follows.

(a) Each generated element *Vj* is used only once to contribute to a frontal matrix.

(b) The row and column index lists for the rectangular frontal matrix *Fj* correspond to the nonzeros in column *Lj* :*n,j* and nonzeros in row *Uj,j* :*n*, respectively.

These conditions cannot both hold. An approach that satisfies (a) can be based on the sparsity pattern of <sup>S</sup>{*<sup>A</sup>* <sup>+</sup> *AT* } and storing some explicit zeros if <sup>S</sup>{*A*} is not symmetric. It is then analogous to the symmetric multifrontal method. In this case, although the frontal and generated elements may be numerically nonsymmetric, they are square and structurally symmetric. This approach performs well if S{*A*} is close to symmetric, that is, the symmetry index of *A* is close to unity.

An approach that satisfies (b) and not necessarily (a) splits the generated elements into smaller ones that are embedded into further rectangular frontal matrices. We illustrate this using the example from Figure 6.7, that is, ⎛⎞

⎟

*,*

⎜

 ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 1 2 3 4 5 6 7 8 9 10 1 ∗ ∗ 2 ∗ ∗ *f* ∗ ∗∗ 3 ∗ ∗∗ 4 ∗∗ ∗ ∗ 5 ∗∗∗ ∗ *f f* 6 ∗ *fff* ∗ *ff f* 7 ∗ ∗ *f f* 8 ∗ ∗ *ffff* ∗ *f f* 9 ∗ ∗ 10 ∗ *f* ∗ *f f* ∗ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

where ∗ are entries in *A* and filled entries in *L* + *U* are denoted by *f* . Taking the entries in the first row and column, the sparsity patterns of the first frontal matrix and the corresponding generated element are ⎛⎜⎞⎟⎛⎞

$$F\_1 = \begin{matrix} 1 & 3 \\ 2 & \* \\ 3 & \* \\ 8 & \* \end{matrix}, \quad V\_1 = \begin{matrix} 3 \\ 2 \\ \* \\ 8 \end{matrix}$$

To construct *F*<sup>2</sup> that satisfies (b) we can only use part of *V*1. From the row and column replication principles, because *a*<sup>13</sup> = 0, the sparsity pattern of column 1 is replicated in that of column 3 of the factors. While the entry in position *(*2*,* 3*)* belongs to *F*2, because of the row replication of the sparsity pattern of the first row in that of the second row, the remaining entries contribute to *F*<sup>3</sup> and so we split *V*<sup>1</sup> into two as follows 

\*\*Lemma:\*\*,\*\* une normam \*\*cours comuncave to 1 y una \*\*cours\*\* 
$$\begin{array}{ll} \mathfrak{3} & \mathfrak{3} \\ \mathrm{V}\_{1}^{2} = 2 & \left( \begin{array}{c} f \\ \end{array} \right), & \mathrm{V}\_{1}^{3} = \frac{3}{8} \begin{pmatrix} \* \\ f \end{pmatrix}, & \mathrm{V}\_{1} = \mathrm{V}\_{1}^{2} \oplus \mathrm{V}\_{1}^{3}, \end{array}$$

where ←→ is the extend-add operator and *V* <sup>2</sup> <sup>1</sup> and *<sup>V</sup>* <sup>3</sup> <sup>1</sup> contribute to *F*<sup>2</sup> and *F*3, respectively. Then *F*<sup>2</sup> and the corresponding generated element *V*<sup>2</sup> are ⎛⎞⎛⎞

⎠

⎛

⎞

⎞

⎟

⎝

$$\begin{array}{ccccccccc} & 2 & 5 & 8 & 10 & & & 2 & 3 & 5 & 8 & 10 \\ & 2 & \left(\* & \* & \* & \* & \\ & \* & & & & & \oplus & V\_1^2 = 6 \\ & & \* & & & & 8 & \left(\* & & & \end{\*}\right) & \bigoplus \, V\_1^2 = \begin{array}{ccccc} & & 3 & 5 & 8 & 10 \\ \* & & & & & & \\ \* & & & & & & \end{array} \end{array}$$

Consider the following splitting of *V*<sup>2</sup> 

⎠

$$\mathbb{V}\_2 = \overset{\mathfrak{G}}{\underset{\mathfrak{S}}{6}} \begin{pmatrix} f \\ f \end{pmatrix} \oplus \overset{\mathfrak{G}}{\underset{\mathfrak{S}}{6}} \begin{pmatrix} f \\ f \end{pmatrix} \oplus \overset{\mathfrak{G}}{\underset{\mathfrak{S}}{6}} \begin{pmatrix} f & f \\ \* & f \end{pmatrix} \equiv V\_2^3 \rightsquigarrow V\_2^5 \oplus V\_2^6.$$

⎛

⎜

The next frontal matrix is 

$$\begin{array}{ccccc} & & 3 & 4 & & & 4\\ & 3 & 4 & & & & 4\\ F\_3 = \frac{3}{4} \left( \begin{matrix} \* & \*\\ \* & \* \end{matrix} \right) \oplus \begin{matrix} \* & \* \end{matrix} \oplus \begin{array}{c} & & & 4\\ \* & \* & \* \end{matrix} & & & & & 4\\ \oplus \begin{pmatrix} f & \*\\ f & \* \end{pmatrix}, & V\_3 = 6 \begin{pmatrix} \*\\ f\\ f \end{pmatrix} \cdot \\\ \end{array}$$

The subsequent steps can be described in a similar way.

Theorem 6.16 expresses the nested relationship between the nonsymmetric multifrontal method and the nonsymmetric elimination tree.

**Theorem 6.16 (Eisenstat & Liu 2005b)** *Assume A is a general nonsymmetric matrix and t* = *parent (k) in* T *(A). Then*

$$
\mathcal{S}\{L\_{t\wedge n,k}\} \subseteq \mathcal{S}\{L\_{t\wedge n,l}\} \quad \text{and} \quad \mathcal{S}\{U\_{k,t\wedge n}\} \subseteq \mathcal{S}\{U\_{l,t\wedge n}\}.
$$

*Proof* Because *<sup>t</sup>* is the parent of *<sup>k</sup>*, by definition *<sup>t</sup>* <sup>G</sup>*(L)* ⇒ *<sup>k</sup>* <sup>G</sup>*(U )* ⇒ *<sup>t</sup>*. If *uij* <sup>=</sup> 0, then a multiple of column *i* is added to column *j* during the LU factorization. Thus, by a simple induction argument, for each *<sup>j</sup>* on the path *<sup>k</sup>* <sup>G</sup>*(U )* ⇒ *<sup>t</sup>*, we must have S{*Lj* :*n,k*} ⊆ S{*Lj* :*n,j* }*.* In particular, this holds for column *t*. The second part follows by a similar argument using the path *<sup>t</sup>* <sup>G</sup>*(L)* ⇒ *<sup>k</sup>*.

This result shows that the parent relationship in the nonsymmetric elimination tree guarantees that both row and column replications can be applied at the same time. Hence all entries of the submatrices of the generated element *Vk* with indices greater than or equal to *parent (k)* can be added to *Vparent (k)* using the operation ←→ . To illustrate this, consider again the 10 × 10 example above for which *parent (*1*)* = 3. Theorem 6.16 guarantees that *V*<sup>1</sup> can be embedded into *F*<sup>3</sup> because S{*L*3:*n,*1} ⊆ S{*L*3:*n,*3} and S{*U*1*,*3:*n*} ⊆ S{*U*3*,*3:*n*}.

⎝

#### **6.3 Preprocessing Sparse Matrices**

We now turn our attention to preprocessing techniques that can help in computing an LU factorization. In particular, we consider when *A* does not have a full transversal (that is, it has one or more zeros on the diagonal). For numerical stability and to reduce the number of permutations required during the factorization, it can be useful to permute *A* before the factorization begins to put large nonzero entries on the diagonal. As an example, consider the matrix *A* in Figure 6.9. It has *a*<sup>22</sup> = 0 and we want to know whether it can be permuted so that all the diagonal entries are nonzero. This question and its answer can be formulated in terms of matchings and bipartite graphs.

#### *6.3.1 Bipartite Graphs and Matchings*

Given a graph G = *(*V*,* E*)*, an edge subset M ⊆ E is called a **matching** (or assignment) if no two edges in M are incident to the same vertex. In matrix terms, a matching corresponds to a set of nonzero entries with no two belonging to the same row or column. A vertex is matched if there is an edge in the matching incident on the vertex, and is unmatched (or free) otherwise. The **cardinality** of a matching is the number of edges in it. A **maximum cardinality matching** (or **maximum matching**) is a matching of maximum cardinality. A matching is **perfect** if all the vertices are matched.

A **bipartite graph** is an undirected graph whose vertices can be partitioned into two disjoint sets such that no two vertices within the same set are adjacent, that is, each set is an **independent set**. Let the *n*×*n* matrix *A* have entries {*aij* }. Associated with *A* is a bipartite graph defined as a triple G*<sup>b</sup>* = *(*V*row,* V*col,* E*)*, where the row vertex set V*row* = {*i* |*aij* = 0} and the column vertex set V*col* = {*j* |*aij* = 0} correspond to the rows and columns of *A* and there is an (undirected) edge *(i, j )* ∈ E if and only if *aij* = 0. This is illustrated in Figure 6.9. We use prime to distinguish between the independent set of row vertices and the independent set of column vertices, that is, *i* denotes a row vertex and *i* denotes a column vertex.

If *A* is structurally nonsingular, a matching M in G*<sup>b</sup>* is perfect if it has cardinality *n*. A perfect matching defines an *n* × *n* permutation matrix *Q* with entries *qij* given by 

$$q\_{ij} = \begin{cases} 1, & \text{if } (j, i') \in \mathcal{M}, \\ 0, & \text{otherwise}. \end{cases}$$

Both *QA* and *AQ* have the matching entries on the (zero-free) diagonal. *Q* and the column permuted matrix *AQ* for the example in Figure 6.9 are given in Figure 6.10.

**Figure 6.9** A sparse matrix and its bipartite graph G*<sup>b</sup>* (left). The matched matrix entries are in blue and edges that belong to a perfect matching in G*<sup>b</sup>* are given by the blue dashed lines (right). Note that the perfect matching is not unique (an alternative is in Figure 6.11).

Q = ⎛ ⎜⎜⎜⎜⎜⎝ 123456 1 1 2 1 3 1 4 1 5 1 6 1 ⎞ ⎟⎟⎟⎟⎟⎠ AQ = ⎛ ⎜⎜⎜⎜⎜⎝ 3- 1- 4- 2- 5- 6- <sup>1</sup> <sup>∗</sup> <sup>∗</sup> <sup>2</sup> <sup>∗</sup> <sup>∗</sup> <sup>∗</sup> <sup>3</sup> <sup>∗</sup> <sup>∗</sup> <sup>4</sup> <sup>∗</sup> <sup>∗</sup> <sup>5</sup> <sup>∗</sup> <sup>∗</sup> <sup>6</sup> ∗ ∗ <sup>∗</sup> ⎞ ⎟⎟⎟⎟⎟⎠

**Figure 6.10** The permutation matrix *Q* and the column permuted matrix *AQ* corresponding to the matrix in Figure 6.9. The matched entries are on the diagonal of *AQ*.

#### *6.3.2 Augmenting Paths*

If a perfect matching exists, it can be found using augmenting paths. A path P in a graph is an ordered set of edges in which successive edges are incident to the same vertex. P is called an M**-alternating path** if the edges of P are alternately in M and not in M. An M-alternating path is an M**-augmenting path** in G*<sup>b</sup>* if it connects an unmatched column vertex with an unmatched row vertex. Note that the length of an M-augmenting path is an odd integer.

## **ALGORITHM 6.3 Maximum matching algorithm Input:** An undirected graph. **Output:** Output maximum matching. 1: Find an initial matching M For example, M = ∅ 2: **while** there exists a M-augmenting path P **do**


**Figure 6.11** An illustration of the search for a perfect matching using augmenting paths. On the left, the dashed lines represent a matching with cardinality 5. In the centre, the blue line is an augmenting path with end vertices 2 and 2 . On the right is the perfect matching with cardinality 6 that is obtained using the augmenting path.

Let M and P be subsets of E and define the symmetric difference

$$
\mathcal{M} \oplus \mathcal{P} := (\mathcal{M} \backslash \mathcal{P}) \cup (\mathcal{P} \backslash \mathcal{M}),
$$

that is, the set of edges that belongs to either M or P but not to both. If M is a matching and P is an M-augmenting path, then M ⊕ P is a matching with cardinality |M|+1. Growing the matching in this way is called augmenting along P. The next result shows that augmenting paths can be used to find a maximum matching (Algorithm 6.3).

**Theorem 6.17 (Berge 1957)** *A matching* M *in an undirected graph is a maximum matching if and only if there is no* M*-augmenting path*

Figure 6.11 demonstrates the procedure. On the left is a bipartite graph with a matching with cardinality 5. In the centre, an augmenting path 2 ⇒ 3 ⇒ 3 ⇒ 4 ⇒ 4 ⇒ 2 is shown. Augmenting the matching along this path, the cardinality of the matching increases to 6 and M ⊕ P is a perfect matching.

### *6.3.3 Weighted Matchings*

While the maximum matching algorithm finds a permutation of *A* such that the permuted matrix has nonzero diagonal entries, there are more sophisticated variations that aim to ensure the absolute values of the diagonal entries of the permuted matrix (or their product) are in some sense large. This can increase the likelihood that the permuted matrix is strongly regular and reduce the need for partial pivoting during the LU factorization. The core problem is as follows: given an *n* × *n* matrix *A*, find a matching of the rows to the columns such that the product of the matched entries is maximized. That is, find a permutation vector *q* that maximizes 

$$\prod\_{l=1}^{n} |a\_{lq\_l}|.\tag{6.5}$$

Define a matrix *C* corresponding to *A* with entries *cij* ≥ 0 as follows:

$$c\_{ij'} = \begin{cases} \log(\max\_l |a\_{lj'}|) - \log |a\_{lj'}|, & \text{if } a\_{lj'} \neq 0\\ \infty, & \text{otherwise.} \end{cases} \tag{6.6}$$

It is straightforward to see that finding a *q* that solves (6.5) is equivalent to finding a *q* that minimizes 

$$\sum\_{l=1}^{n} |c\_{lq\_l}|,\tag{6.7}$$

which is equivalent to finding a minimum weight perfect matching in an edge weighted bipartite graph. This is a well-studied problem and is known as the bipartite weighted matching or linear sum assignment problem.

If G*<sup>b</sup>* = *(*V*row,* V*col,* E*)* is the bipartite graph associated with *A* then let G*b(C)* = *(*V*row,* V*col,* E*)* be the corresponding weighted bipartite graph in which each edge *(i, j )* ∈ E has a weight *cij* ≥ 0. The weight (or cost) of a matching M in G*b(C)*, denoted by *csum(*M*)*, is the sum of its edge weights; i.e. *csum(*M*)* <sup>=</sup> 

$$csum(\mathcal{M}) = \sum\_{(i,j') \in \mathcal{M}} c\_{ij'}.$$

A perfect matching M in G*b(C)* is said to be a **minimum weight perfect matching** if it has smallest possible weight, i.e. *csum(*M*)* <sup>≤</sup> *csum(*M*)* for all possible perfect matchings <sup>M</sup>.

The key concept for finding a minimum weight perfect matching is that of a **shortest augmenting path**. An M-augmenting path P starting at an unmatched column vertex is called **shortest** if

$$
\operatorname{csum}(\mathcal{M} \oplus \mathcal{P}) \le \operatorname{csum}(\mathcal{M} \oplus \widehat{\mathcal{P}}).
$$

for all other possible M-augmenting paths P starting at the same column vertex. A matching M*<sup>e</sup>* is **extreme** if and only if there exist *ui* and *vj* (which are termed **dual variables**) satisfying 

$$\begin{cases} c\_{lj'} = u\_l + v\_{j'}, & \text{if } (i, j') \in \mathcal{M}\_{\mathfrak{e}}, \\ c\_{lj'} \ge u\_l + v\_{j'}, & \text{otherwise.} \end{cases} \tag{6.8}$$

This is employed by the MC64 algorithm. The dual variables will be important when we discuss scaling sparse matrices in Section 7.4.2. The MC64 algorithm is outlined here as Algorithm 6.4. It starts with a feasible solution and corresponding extreme matching and then proceeds to iteratively increase its cardinality by one by constructing a sequence of shortest augmenting paths until a perfect extreme matching is found. The algorithm can be made more efficient if a large initial extreme matching can be found. For example, Step 3 can be replaced by setting *ui* = min{*cij* | *j* ∈ S{*Ai,*<sup>1</sup>:*n*}} for *i* ∈ V*row* and *vj* = min{*cij* − *ui*| *i* ∈ S{*A*1:*n,j* }} for *j* ∈ V*col*. In Step 4, an initial extreme matching can be determined from the edges for which *cij* − *ui* − *vj* = 0.

There are a number of potential problems with the MC64 algorithm. First, the runtime is hard to predict and depends on the initial ordering of *A*. Second, it is a serial algorithm and as such it can represent a significant fraction of the total factorization time of a direct solver. Because the complexity of Step 6 of Algorithm 6.4 is *O((n* + *nz(A))*log *n)* and the complexity of Step 7 is *O(n)* and of Step 8 is *O(n*+*nz(A)*, MC64 has a worst-case complexity of *O(n(n*+*nz(A))*log *n)*. In practice, this bound is not achieved and the algorithm is widely used.

#### **ALGORITHM 6.4 Outline of the MC64 algorithm**

**Input:** Matrix *A*.

**Output:** A matching M and dual variables *ui*, *vj* .


solution

4: Set M = {*(i, j*

*)*| *ui* + *vj* } Initial extreme matching


#### *6.3.4 Dulmage-Mendelsohn Decompositions*

The importance of preordering *A* to block triangular form was discussed in Section 3.4. The **Dulmage-Mendelsohn decomposition** is based on matchings and is a generalization of the block triangular form. It provides a precise characterization of structurally rank deficient matrices and it can be used to reduce the work required for an LU factorization. It comprises row and column permutations *P* and *Q* such that ⎛⎝⎞⎠

$$
\begin{array}{c}
\mathcal{C}\_1 \quad \mathcal{C}\_2 \quad \mathcal{C}\_3 \\
\mathcal{R}^A \mathcal{Q} = \begin{array}{c}
\mathcal{R}\_1 \\
\mathcal{R}\_2 \\
\mathcal{R}\_3
\end{array} \begin{pmatrix}
A\_1 & A\_4 & A\_6 \\
0 & A\_2 & A\_5 \\
0 & 0 & A\_3
\end{pmatrix} .
\end{array} \tag{6.9}
$$

Here *A*<sup>1</sup> is an *m*<sup>1</sup> × *n*<sup>1</sup> underdetermined matrix (*m*<sup>1</sup> *< n*<sup>1</sup> or *m*<sup>1</sup> = *n*<sup>1</sup> = 0), *A*<sup>2</sup> is an *m*<sup>2</sup> × *m*<sup>2</sup> square matrix and *A*<sup>3</sup> is an *m*<sup>3</sup> × *n*<sup>3</sup> overdetermined matrix (*m*<sup>3</sup> *> n*<sup>3</sup> or *<sup>m</sup>*<sup>3</sup> <sup>=</sup> *<sup>n</sup>*<sup>3</sup> <sup>=</sup> 0). It can be shown that *AT* <sup>1</sup> and *A*<sup>3</sup> are strong Hall matrices but *A*<sup>2</sup> need not be a strong Hall matrix, in which case it can be permuted to block upper triangular form.

If row and column sets R and C form a maximum matching of *A*, then R<sup>1</sup> and R<sup>2</sup> are subsets of R and |R<sup>3</sup> ∩ R| = *n*3, and C<sup>2</sup> and C<sup>3</sup> are subsets of C and |C<sup>1</sup> ∩ C| = *m*1. An example decomposition for a 10 × 10 matrix is given in Figure 6.12. Here R = {1*,* 2*,...,* 9} and C = {2*,* 3*,...,* 10}.

The **coarse Dulmage-Mendelsohn decomposition** orders the unmatched columns as the first columns in *P AQ* and orders the unmatched rows as the last rows in *P AQ.* If *A* is square and has a perfect matching then its coarse decomposition has only the matrix *A*2; otherwise, both *A*<sup>1</sup> and *A*<sup>3</sup> are present. The coarse decomposition is computed by first finding a maximum matching. Assuming it is not a perfect matching, the rows in *A*<sup>1</sup> are determined by performing depth-first searches from the unmatched columns to find all of the row vertices that

.

**Figure 6.12** An example of a coarse Dulmage-Mendelsohn decomposition. The blue entries belong to the maximum matching. *m*<sup>1</sup> = 3, *m*<sup>2</sup> = 4, *m*<sup>3</sup> = 3, *n*<sup>1</sup> = 4, *n*<sup>2</sup> = 4, *n*<sup>3</sup> = 2. Column 1 and row 10 are unmatched.

are reachable from the unmatched columns via alternating augmenting paths. The columns in *A*<sup>1</sup> are defined to be the union of the set of unmatched columns and the set of columns matched with the rows in *A*1. Similarly, the columns in *A*<sup>3</sup> are determined by performing depth-first searches from the unmatched rows to find all of the column vertices that are reachable from the unmatched rows via alternating augmenting paths. The rows in *A*<sup>3</sup> are defined to be the union of the set of rows matched to the columns in *A*<sup>3</sup> and the set of unmatched rows.

It may be possible to further permute the matrix to obtain the **fine Dulmage-Mendelsohn decomposition**. The fine Dulmage-Mendelsohn decomposition computes *P* and *Q* such that *A*<sup>1</sup> and *A*<sup>3</sup> are block diagonal matrices in which each diagonal block is irreducible, and *A*<sup>2</sup> is block upper triangular with strongly connected (square) diagonal blocks. Once the coarse decomposition has been computed, *A*<sup>1</sup> and *A*<sup>3</sup> are searched to find any irreducible blocks and the permutations required to place these on the diagonals of *A*<sup>1</sup> and *A*<sup>3</sup> are computed. Finally, following Section 3.4, strongly connected components in *A*<sup>2</sup> are found and a permutation is formed to reduce *A*<sup>2</sup> to block upper triangular form (with the strongly connected components lying on the diagonal). If *A* is reducible and nonsingular, the fine Dulmage-Mendelsohn decomposition can be used to solve the linear system *Ax* = *b* using block back-substitution.

#### **6.4 Notes and References**

Early theoretical results related to sparse LU factorizations can be found in Rose & Tarjan (1978), which extends the systematic understanding of the symbolic elimination introduced in Rose et al. (1976). A key paper that influenced the discussion and development of both the theory and algorithms for predicting sparsity structures in LU factorizations is Gilbert (1994) (first available in 1986 as a Cornell technical report). As the primary and still very useful resource on transitive reduction, we refer to Aho et al. (1972); Gilbert & Liu (1993) extend the concept of an elimination tree to study sparse LU factorizations of nonsymmetric matrices and present theoretical concepts based on DAGs; see also the parallel counterpart in Grigori et al. (2007). Ways to simplify symbolic factorizations and prune DAGs are discussed in Eisenstat & Liu (1992, 1993a). An elegant treatment of both the theoretical and practical aspects of LU factorizations based on DAGs and the nonsymmetric elimination tree (including pruning and pivoting) is given in a series of three papers by Eisenstat & Liu (2005a,b, 2007).

Partial pivoting within the sparse column LU factorization is introduced in Gilbert & Peierls (1988). This paper influenced not only further developments in sparse LU factorizations but also the development of incomplete factorizations. Partial pivoting based on the column elimination tree is first discussed in George & Ng (1985); see also Gilbert & Ng (1993) and Li (1996) for further use of column elimination trees. Further research on exactness of structural predictions is presented by Grigori et al. (2009).

The proof of Theorem 6.17 is given by Berge (1957) but the result was observed earlier (for example, König (1931)). Preordering nonsymmetric matrices using matching algorithms is explained in Duff & Koster (1999, 2001). It is based on the Hungarian algorithm of Kuhn (1955) and a sparse variant of the shortest path algorithm of Dijkstra (1959). Duff and Koster implemented their algorithm in the widely used software package MC64. Because MC64 can be expensive to run, there has been interest in developing efficient parallel algorithms for finding a perfect matching in a weighted bipartite graph (Azad et al., 2020) and also in relaxing the optimality requirement to allow the development of cheaper algorithms that can be parallelised; see, for example, Hogg & Scott (2015). A classical paper that describes the Dulmage-Mendelsohn decomposition is Pothen & Fan (1990).

The development of supernodal LU factorizations is closely connected with that of column LU factorizations. A key paper is by Demmel et al. (1999), in which different types of supernodes for nonsymmetric matrices are considered.

Duff & Reid (1984) describe a symmetric-pattern multifrontal algorithm for nonsymmetric matrices that generates an assembly tree based on the structure of *<sup>A</sup>*+*AT* . This employs square frontal matrices and can incur a substantial overhead for highly nonsymmetric matrices because of unnecessary data dependencies in the assembly tree and extra explicit zeros in the artificially symmetrized frontal matrices. Davis & Duff (1997) introduce an nonsymmetric-pattern multifrontal algorithm that seeks to overcome these deficiencies by using rectangular frontal matrices. This work later developed into the package UMFPACK of Davis (2004), while Amestoy & Puglisi (2002) propose an nonsymmetric version of the multifrontal method that can be regarded as being intermediate between the nonsymmetric-pattern variant of UMFPACK and the symmetric-pattern multifrontal method. The Watson Sparse Matrix Package (WSMP, 2020) also uses a nonsymmetric multifrontal algorithm.

Notable early sparse LU solvers were the Yale Sparse Matrix Package (YSMP) of Eisenstat et al. (1977) and the Harwell Subroutine Library code MA28 written by Duff (1980), followed later by MA48 of Duff & Reid (1996). These codes address important practical considerations (for serial computations). Furthermore, the rightlooking Markowitz packages MA28 and MA48, which are designed particularly for highly nonsymmetric matrices, combine the symbolic and numerical factorization phases into a single analyse-factorize phase. Contemporary software packages such as PARDISO (2022), SuperLU (Li et al., 1999), UMFPACK and WSMP have been developed over many years. They provide one of the best ways of understanding the practical value of the ideas presented in research papers and technical reports. PARDISO combines left and right-looking updates in a parallel sharedmemory code that assumes a symmetric nonzero sparsity pattern. SuperLU offers a left-looking supernodal variant for sequential machines, SuperLU\_MT for sharedmemory parallel machines, and the right-looking supernodal SuperLU\_DIST (Li & Demmel, 2003) for highly parallel distributed memory hybrid systems. Demmel et al. (1999) and Li (2008) describe the algorithms and performance on various machines. The WSMP software is split into a serial and multithreaded singleprocess library for use on a single core or multiple cores on a shared-memory machine, and a separate library for distributed memory environments.

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Stability, Ill-Conditioning, and Symmetric Indefinite Factorizations**

*Solving sparse symmetric indefinite systems is more problematic. – Ashcraft et al. (1998).*

*The factorization of sparse symmetric indefinite systems is particularly challenging since pivoting is required to maintain stability of the factorization. Pivoting techniques generally offer limited parallelism and are associated with significant data movement hindering the scalability of these methods – Duff et al. (2018).*

Practical computations are invariably based on finite precision arithmetic. Describing the accuracy of such computations often uses the concept of stability. Consider a computational algorithm *z* = *g(d)* for computing *z* as a function *g* of given data *d*. The algorithm is said to be **backward stable** if the computed solution *z*ˆ is the exact solution of *z*ˆ = *g(d* + *d)*, where the perturbation *d* is "small" for all possible inputs *d*. What is meant by small depends on the context. For example, if *d* is based on physical measurements that are necessarily inaccurate, *d* is small if it is of the same or smaller absolute value as the inaccuracies in determining *d*. The minimum absolute value |*d*| among such perturbations is called the (absolute) **backward error** (or, if divided by |*d*|, the **relative backward error**). To distinguish them from backward errors, the absolute and relative errors of *z*ˆ are called **forward errors**. Backward stability is a property of the computational algorithm and to compute solutions with a small backward error we need to consider stable algorithms.

A related concept that influences the quality of the computed solution is **illconditioning**. The problem *z* = *g(d)* is said to be ill-conditioned if small perturbations in the data *d* can lead to large changes in the computed *z*ˆ. The **condition number** measures how sensitive the output of a function is to its input. Illconditioning, which is measured in terms of the condition number, is a property of the problem. Provided the backward error, forward error, and the condition number are defined in a consistent manner, the following approximate inequality holds:

> forward error condition number × backward error*.*

This says that the computed solution to an ill-conditioned problem can have a large forward error because even if the computed solution has a small backward error, this error can be amplified by a large condition number. By preprocessing the problem it may be possible to improve its conditioning. In this chapter, we discuss both the stability of numerical factorizations and preprocessing of the linear system to improve conditioning.

### **7.1 Backward Stability**

We start with a simple backward error result. Here denotes the machine precision.

**Theorem 7.1 (Demmel 1997; Watkins 2002)** *Let the computed LU factorization of a matrix A be A*+*A* = *L U. The perturbation <sup>A</sup> that results from using finite precision arithmetic satisfies* 

$$||\Delta A||\_{\infty} \le n \, \mathcal{O}(\epsilon) ||\widehat{L}||\_{\infty} ||\widehat{U}||\_{\infty} + \mathcal{O}(\epsilon^2). \tag{7.1}$$

*Moreover, the computed solution x*ˆ *of the linear system Ax* = *b satisfies (A* + *A) x*ˆ = *b with* 

$$||\Delta'A||\_{\infty} \le n\,\, O(\epsilon)\,||\widehat{L}||\_{\infty}||\widehat{U}||\_{\infty} + O(\epsilon^2). \tag{7.2}$$

At stage *k* of Gaussian elimination, the computed diagonal entry *a(k) kk* is termed the **pivot** (1 ≤ *k<n*). Gaussian elimination breaks down if a zero pivot is encountered. Provided *A* is nonsingular, row interchanges can be incorporated to prevent this happening (Theorem 1.1). The systematic use of row permutations is called **partial pivoting** and was introduced in Section 3.1.2. If <sup>|</sup>*a(k) kk* | is very small (compared to other entries in the active submatrix), then it can cause difficulties in finite precision arithmetic because the absolute value of the corresponding computed multiplier *lik* <sup>=</sup> *<sup>a</sup>(k) ik /a(k) kk* can then be very large. Partial pivoting can be used to ensure |*lik*| ≤ 1, that is, the rows of *A* that have not yet been pivoted on can be permuted so that the new pivot satisfies

$$\max\_{i>k} |a\_{ik}^{(k)}| \le |a\_{kk}^{(k)}|.$$

If *Pk* is the row permutation at stage *k* and *P* = *Pn*−<sup>1</sup>*Pn*−<sup>2</sup> *...P*1, then the computed factors of *P A* satisfy 

$$||\hat{L}||\_{\infty} \le n \quad \text{and} \quad ||\hat{U}||\_{\infty} \le n \,\rho\_{growth} ||A||\_{\infty},$$

where the **growth factor** *ρgrowth* is defined to be

$$\rho\_{growth} = \max\_{i,j,k} \left( |a\_{lj}^{(k)}| \, / \, |a\_{lj}| \right). \tag{7.3}$$

The bounds (7.1) and (7.2) can be rewritten as

$$||\Delta A||\_{\infty} \le n^3 \rho\_{growth} \; O(\epsilon) \; ||A||\_{\infty}, \quad ||\Delta' A||\_{\infty} \le n^3 \rho\_{growth} \; O(\epsilon) \; ||A||\_{\infty}.$$

In practice, these bounds are pessimistic and the actual errors are typically much smaller. Because backward stability of an LU factorization is influenced both by the initial ordering of *A* and the pivoting strategy, it is said to be **conditionally backward stable**.

For a symmetric positive definite (SPD) matrix *A*, pivoting for stability is not needed. The following states that the Cholesky factorization of *A* is **unconditionally backward stable**, allowing the stable computation of the solution of the corresponding linear system. 

**Theorem 7.2 (Demmel 1997; Watkins 2002)** *Let the computed Cholesky factorization of an SPD matrix A be A* + *A* = *L L <sup>T</sup> . The perturbation A that results from using finite precision arithmetic satisfies*

$$||\Delta A||\_{\infty} \le n^2 \,\, O(\epsilon) \,||A||\_{\infty}.$$

*Moreover, the computed solution x*ˆ *of the linear system Ax* = *b satisfies (A* + *A)x*ˆ = *b with*

$$||\Delta'A||\_{\infty} \le n^2 \,\,\partial(\epsilon) \,||A||\_{\infty} \cdot \varepsilon$$

Both the unconditional backward stability of a Cholesky factorization of an SPD matrix and the conditional backward stability of an LU factorization of a general *A* make algorithms for solving linear systems that are based on factorizing *A* preferable to computing and applying *A*−1. The computed inverse is typically not the exact inverse of a nearby matrix *A* + *A* for any small perturbation *A*. Furthermore, the following pessimistic result shows it is impractical to compute and store *A*−1, regardless of how sparse *A* is.

**Theorem 7.3 (Duff et al. 1988)** *If A is irreducible, then the sparsity pattern* <sup>S</sup>{*A*−1} *of its inverse is fully dense.*

*Proof* Without loss of generality, assume *A* is factorizable. For if not, there is a permutation matrix *P* such that the LU factorization of the row permuted matrix *P A* is factorizable (Theorem 1.1). In this case, consider *P A* instead of *A* because for any permutation matrix *P* the inverse *(P A)*−<sup>1</sup> is fully dense if and only if *A* is fully dense. Let *K* be the matrix of order 2*n* given by 

$$K = \begin{pmatrix} A & I\_n \\ I\_n & 0 \end{pmatrix}.$$

After applying *<sup>n</sup>* elimination steps to *<sup>K</sup>* <sup>=</sup> *<sup>K</sup>(*1*)* , the order *n* active submatrix of *<sup>K</sup>(n*+1*)* is <sup>−</sup>*A*−1. Consider entry *(A*−<sup>1</sup>*)ij* (1 <sup>≤</sup> *i, j* <sup>≤</sup> *<sup>n</sup>*). Because *<sup>A</sup>* is irreducible and the off-diagonal *(*1*,* 2*)* and *(*2*,* 1*)* blocks of *K* are equal to the identity matrix, there is a directed path *i* ⇒ *j* in G*(K)* such that the indices of all the intermediate vertices on the path are less than or equal to *n*. Theorem 3.1 and the non-cancellation assumption imply *(A*−<sup>1</sup>*)ij* <sup>=</sup> 0. It follows that *<sup>A</sup>*−<sup>1</sup> is fully dense.

The above proof implies that entries of *<sup>A</sup>*−<sup>1</sup> correspond to paths in <sup>G</sup>*(A)* when *A* is not irreducible. This result is given in the following corollary.

**Corollary 7.4 (Rose & Tarjan 1978; Duff et al. 1988)** *If A is factorizable, then (A*−<sup>1</sup>*)ij* <sup>=</sup> <sup>0</sup> *(*<sup>1</sup> <sup>≤</sup> *i, j* <sup>≤</sup> *<sup>n</sup>) if and only if there exists a path <sup>i</sup>* <sup>G</sup>*(A)* ⇒ *<sup>j</sup> .*

#### **7.2 Pivoting Strategies for Dense Matrices**

This section briefly describes the pivoting strategies that are used in LU factorizations of general dense matrices and, in the symmetric indefinite case, in LDLT factorizations. Here and in the following sections, all the quantities (such as *a(k) ij* ) are the computed quantities.

#### *7.2.1 Partial Pivoting*

Partial pivoting interchanges rows at each stage of the factorization to select the entry of largest absolute value in its column as the next pivot (Section 3.1.2). If partial pivoting is used, it is straightforward to show that the growth factor (7.3) satisfies

$$
\rho\_{\text{growth}} \le 2^{n-1}.
$$

Although the bound can be achieved in nontrivial cases, it is generally extremely pessimistic, particularly when *n* is very large. In practice, Gaussian elimination with partial pivoting is often regarded as being a stable algorithm and is the pivoting strategy of choice for dense matrices.

#### *7.2.2 Complete Pivoting*

A much smaller bound can be obtained if complete (or full) pivoting is used. It chooses the pivot to be the largest entry (in absolute value) in the active submatrix, that is, at stage *k* the pivot *a(k) kk* is chosen so that

$$\max\_{i \ge k, j \ge k} |a\_{ij}^{(k)}| \le |a\_{kk}^{(k)}|.$$

In this case,

$$
\rho\_{growth} \le n^{1/2} (2. \ 3^{1/2}. 4^{1/3} \dots n^{1/(n-1)})^{1/2}. \tag{7.4}
$$

The disadvantages of complete pivoting are that it is expensive (the whole active submatrix must be searched for a pivot), and because the test is tougher than for partial pivoting, it is more likely that permutations (and hence more data movement) will be required.

#### *7.2.3 Rook Pivoting*

A pivoting strategy that is more restrictive than partial pivoting but cheaper than complete pivoting is **rook pivoting**. Here the pivot is chosen to be the largest entry in its row *and* its column, that is,

$$\max\_{i>k} \left( |a\_{ik}^{(k)}|, |a\_{kl}^{(k)}| \right) \le |a\_{kk}^{(k)}|.$$

The strategy takes its name from the fact that the search for a pivot corresponds to the moves of a rook in the game of chess. Clearly, the search for a pivot in rook pivoting involves at least twice as many comparisons as for partial pivoting and if the whole active submatrix has to be searched, then the number of comparisons is the same as for complete pivoting. However, in practice, the cost is usually a small multiple of the cost of partial pivoting and significantly less than that of complete pivoting. The growth factor for rook pivoting satisfies

$$
\rho\_{growth} \le 1.\mathsf{S} \, n^{(\mathsf{3}/4)\log n} \,\, .
$$

## *7.2.4* **2 × 2** *Pivoting*

When the matrix *A* is symmetric but indefinite, it may not be possible to select pivots from the diagonal (for example, if all the diagonal entries of *A* are zero). If rows of *A* are permuted (so that off-diagonal entries are selected as pivots), then symmetry is destroyed, which means an LU factorization must be performed and this essentially doubles the cost of the factorization in terms of both storage and operation counts. Symmetry can be preserved by extending the notion of a pivot to 2 × 2 blocks. 

Consider the symmetric indefinite *A* given by

*A* = *δ* 1 1 0 *.*

If *δ* = 0, an LDLT factorization in which *D* is a diagonal matrix does not exist. Furthermore, if *δ* 1, then an LDLT factorization with *D* diagonal is not stable because *ρgrowth* = 1*/δ*. However, if the LDLT factorization is generalized to allow *D* to be a block diagonal matrix with 1 × 1 and 2 × 2 blocks, then a factorization is obtained that preserves symmetry and is nearly as stable as an LU factorization. This is illustrated by the factorization of the following 3 × 3 symmetric indefinite matrix ⎛⎝⎞⎠⎛⎝⎞⎠⎛⎝⎞⎠⎛⎝⎞⎠

$$A = \begin{pmatrix} 1 & 1 & 0 \\ 1 & 1 & 1 \\ 0 & 1 & 0 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \end{pmatrix} \begin{pmatrix} 1 & 1 & 0 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{pmatrix} = LDL^T.$$

Here *D* has one 1 × 1 block and one 2 × 2 block.

Rook pivoting can be extended to include 2 × 2 pivots. An iterative procedure searches for an entry that is simultaneously the largest in absolute value in row *i* and column *<sup>j</sup>* of the active submatrix *A(k)*. This entry is used to build a symmetric <sup>2</sup>×<sup>2</sup> pivot; the search terminates prematurely if a suitable 1 × 1 pivot is found, that is, a pivot that satisfies a threshold test. The standard choice for the threshold comes from requiring the same potential maximal growth in the absolute values of the entries of the partially eliminated matrix that results from either two consecutive 1 × 1 pivots or one <sup>2</sup> <sup>×</sup> <sup>2</sup> pivot. It can be shown that the appropriate choice is *(*<sup>1</sup> <sup>+</sup> <sup>√</sup>17*)/*8. In this case, the growth factor satisfies 

$$
\rho\_{growth} < 3n\sqrt{2} \, 3^{1/2} 4^{1/3} \dots n^{1/(n-1)},
$$

which is only slightly worse than the bound (7.4) for an LU factorization with complete pivoting. Note that the number of partially eliminated matrices depends on the number of 2 × 2 pivots. If a 2 × 2 pivot is selected at stage *k*, then the next partially eliminated matrix is *A(k*+2*)* .

#### **7.3 Pivoting Strategies for Sparse Matrices**

#### *7.3.1 Threshold Partial Pivoting*

While the growth factor is important, for sparse matrices the pivoting strategies discussed so far lack the scope to preserve sparsity. In the sparse case, it is necessary to balance pivoting for stability with limiting the amount of fill-in in the factors. The compromise strategy that seeks to achieve this is called **threshold partial pivoting**, which is a generalization of partial pivoting. At stage *k* of the numerical factorization phase of a sparse LU solver, the pivot is selected so that after permuting it to the first entry of the active submatrix *A(k)* it satisfies

$$\max\_{k>k} |a\_{lk}^{(k)}| \le \nu^{-1} |a\_{kk}^{(k)}|,\tag{7.5}$$

where *γ* ∈ *(*0*,* 1] is a chosen **threshold parameter**. It is straightforward to see that

$$\max\_{j} |a\_{lj}^{(k)}| \le (1 + \gamma^{-1}) \max\_{l} |a\_{lj}^{(k-1)}| \le (1 + \gamma^{-1})^{nz\_j} \max\_{j} |a\_{lj}|,$$

where *nzj* is the number of off-diagonal entries in the *j* -th column of the U factor. Furthermore,

$$
\rho\_{growth} \le (1 + \gamma^{-1})^{nz\_{cmax}},
$$

where *nzcmax* = max*<sup>j</sup> nzj* ≤ *n* − 1. Choosing *γ* = 1 reduces to partial pivoting; using a smaller value potentially leads to greater growth in the size of the entries in the factors but allows pivots to be chosen that are better able to preserve sparsity. The default choice for *γ* is typically between 0.1 and 0.01 but in some practical applications much smaller values are sometimes employed to speed up the factorization (at the possible cost of less accurate factors).

A threshold can also be incorporated into rook pivoting. The pivot must then be at least *γ* times the absolute value of any other entry in its row and column of the active submatrix. Threshold rook pivoting has the potential to limit growth more successfully than threshold partial pivoting. In the symmetric case, if pivots are selected from the diagonal (to preserve symmetry), threshold partial pivoting is the same as threshold rook pivoting.

## *7.3.2 Threshold* **2 × 2** *Pivoting*

If *A* is a symmetric matrix, then standard fill-reducing ordering algorithms (which will be discussed in the next chapter) and the symbolic factorization phase employ only the sparsity pattern of *A*. In general, if *A* is indefinite, during the numerical factorization it is necessary to modify the chosen elimination order to maintain stability. As already observed, if symmetry is to be preserved, 1×1 and 2×2 pivots are needed, resulting in an LDLT factorization in which *D* is a block diagonal matrix with 1 × 1 and 2 × 2 blocks. Limiting the size of the entries of *L* so that

$$|l\_{lj}| \le \mathcal{Y}^{-1} \tag{7.6}$$

for all *i, j* , together with a backward stable scheme for solving 2×2 linear systems, suffices to show backward stability for the entire solution process.

In the sparse symmetric indefinite case, the stability test for a 1 × 1 pivot in column *t* of the active submatrix at stage *k* is the standard threshold test

$$\max\_{l \neq t, \ i \ge k} |a\_{l\mathbf{i}}^{(k)}| \le \mathcal{Y}^{-1} |a\_{t\mathbf{i}}^{(k)}|. \tag{7.7}$$

For a 2 × 2 pivot in rows and columns *s* and *t* the corresponding test is ⎝⎠⎝⎠

$$\left| \begin{pmatrix} a\_{ss}^{(k)} & a\_{sI}^{(k)} \\ a\_{sI}^{(k)} & a\_{ll}^{(k)} \end{pmatrix}^{-1} \right| \begin{pmatrix} \max\_{l \neq s, t; l \ge k} |a\_{ls}^{(k)}| \\\\ \max\_{l \neq s, t; l \ge k} |a\_{lt}^{(k)}| \end{pmatrix} \le \mathcal{V}^{-1} \begin{pmatrix} 1 \\ 1 \end{pmatrix},\tag{7.8}$$

where the absolute value of the matrix is interpreted element-wise. If *a(k) tt* is accepted as a 1 × 1 pivot, it becomes the next diagonal entry of *D* and row and column *t* are permuted (if necessary) to the pivotal position *k*. The corresponding diagonal entry of *L* is 1 and from the inequality (7.7), the off-diagonal entries of column *k* of *L* are bounded in absolute value by *<sup>γ</sup>* <sup>−</sup>1. If *tt*

*a(k) ss a(k) st a(k) st <sup>a</sup>(k)* is accepted as a 2 × 2 pivot, it

becomes the next diagonal block of *D* and rows and columns *s* and *t* are permuted (if necessary) to the next two pivotal positions, *k* and *k* + 1. The corresponding diagonal block of *L* is the identity matrix of order 2 and inequality (7.8) ensures that the off-diagonal entries of these columns of *L* are bounded in absolute value by *γ* <sup>−</sup>1.

In addition to bounding the size of the entries in *L*, the ability to stably apply the inverse of *D* to a vector is required. This is trivially the case for 1 × 1 pivots, but for <sup>2</sup> <sup>×</sup> <sup>2</sup> pivots it is necessary to check that the determinant <sup>|</sup>*a(k) ss a(k) tt* <sup>−</sup> *<sup>a</sup>(k) st <sup>a</sup>(k) st* | is sufficiently large and cancellation does not occur during the application of the inverse.

A major difficulty when stability tests are incorporated into sparse factorizations is that a pivot satisfying the stability criteria may not exist. We discuss this for symmetric indefinite *A* but the same problem occurs for general *A*. Consider the supernodal approach of Section 5.3 and the nodal matrix shown in Figure 7.1. Pivots can only be chosen from the block *Ldiag* on the diagonal (the block is square and symmetric and only its lower triangular part is held) but the entries in the offdiagonal block *Lrect* are involved in the stability tests: large entries in *Lrect* can cause pivot candidates to fail the threshold tests (7.7), (7.8). If *Ldiag* is of order *p* and only *q<p* pivots can be found that satisfy the tests, then *p* − *q* pivots must be **delayed**. That is, the variables that have not been pivoted on are passed up the assembly tree to the parent and the columns of the block column corresponding to these variables are appended to those of the nodal matrix at the parent. The delayed columns are retested at the parent and, if the stability test is still not satisfied, they are passed further up the assembly tree (at the root a full set of *p* pivots can be chosen provided the matrix is non-singular and *γ* ≤ 0*.*5).

Observe that to be able to test for large entries, all the off-diagonal entries in a block column must be fully updated before the block on the diagonal is factorized. This means that the **factorize\_block** task and all the **solve\_block** tasks for a block column that are used in the SPD case (Section 5.3) are combined into a single **factorize\_column** task. Thus there are fewer but larger tasks and this reduces the scope for parallelism.

⎛

**Figure 7.1** An illustration of a simple nodal matrix. Pivot candidates are restricted to the square block *Ldiag* on the diagonal.

The problem of delayed pivots arises also in the multifrontal method. At each stage of the computation there is a dense symmetric indefinite frontal matrix *F* of order *nF* of the form *F*21 *F*22

$$F = \begin{pmatrix} F\_{11} & F\_{21}^T \\ F\_{21} & F\_{22} \end{pmatrix},\tag{7.9}$$

where *F*<sup>11</sup> is a *p* × *p* matrix corresponding to the fully summed variables. Pivots can only be selected from *F*<sup>11</sup> but the numerical values of the entries in *F*<sup>21</sup> must be taken into account when testing for stability. If *q<p* pivots are found, then the partial factorization of *F* is *PF FP<sup>T</sup> <sup>F</sup>* <sup>=</sup> *LF DFL<sup>T</sup> <sup>F</sup>* , where *PF* = *P*<sup>11</sup> *I* is a permutation matrix with *P*<sup>11</sup> of order *p*, *LF* = *L*<sup>11</sup> *L*<sup>21</sup> *I* with *L*<sup>11</sup> a unit lower triangular matrix of order *q*, and *DF* = *D*<sup>1</sup> , with *D*<sup>1</sup> a block diagonal matrix of

*S* order *q* and *S* a dense matrix of order *nF* −*q*. A basic procedure for selecting pivots and partially factorizing *F* is summarized in Algorithms 7.1 and 7.2. Here updating means applying the elimination operations. Observe that candidate pivots are only permuted to the start of the frontal matrix once they have been accepted (passed the stability test). Algorithm 7.2 can be modified for a supernodal factorization, replacing the frontal matrix by a supernodal matrix.

So far, we have assumed that *A* is nonsingular, but consistent systems of linear equations with a (nearly) singular matrix can occur in practice and only minor modifications are needed to handle this. When a column is searched, if its largest entry is found to have absolute value less than a chosen threshold *δ*, the column (and, by symmetry, the row) is set to zero, the diagonal entry is accepted as a zero 1×1 pivot, and no update pivotal operations are applied to the remaining columns of *F*. This is equivalent to perturbing the entries of *A* in the pivotal column by at most

#### **ALGORITHM 7.1 Simple partial sparse indefinite factorization**

**Input:** Symmetric indefinite matrix *F* of order *nF* of the form (7.9) with *F*<sup>11</sup> of order *p*; threshold *γ* ∈ *(*0*,* 0*.*5].

**Output:** Updated *F*; partial factors *LF* and *DF* and permutation *PF* .


#### **ALGORITHM 7.2 Find a pivot in** *F* **using threshold partial pivoting**

**Input:** *F*, *LF* , *DF* , *PF* , *p*, *q*, *t*, *γ* are accessed from the environment of the call. **Output:** Selected pivot of size *piv*\_*size*; computed columns *q* + 1 : *q* + *piv*\_*size* of *LF* and *DF* , updated *PF* and *t*.

```
1: subroutine find_pivot (piv_size)
2: piv_size = 0
3: for test = 1 : p − q do
4: t = t + 1; if (t>p) set t = q + 1  Column t is searched for a pivot
5: if (there is s such that q + 1 ≤ s ≤ t − 1 and 
                                               fss fst
                                               fst ftt
                                                       passes 2 × 2 pivot
         test) then
6: piv_size = 2
7: Symmetrically permute rows/columns q + 1 and s of F  Update PF
8: Symmetrically permute rows/columns q + 2 and t of F  Update PF
9: Compute columns q + 1 and q + 2 of DF and LF
10: return
11: else if (ftt passes 1 × 1 pivot test) then
12: piv_size = 1
13: Symmetrically permute rows/columns q + 1 and t of F  Update PF
14: Compute column q + 1 of DF and LF
15: return
16: end if
17: end for
18: end subroutine find_pivot
```
*δ* and the computed factorization is of a nearby singular matrix. It is convenient for the subsequent solve phase to store *D*−<sup>1</sup> *<sup>F</sup>* in place of *DF* , with entries on the diagonal corresponding to zero pivots set to zero.

#### *7.3.3 Relaxed and Static Pivoting*

If pivots are delayed during the numerical factorization, then the data structures that were set up during the symbolic phase must be modified. This significantly complicates the development of general and symmetric indefinite sparse direct solvers compared to sparse Cholesky solvers. Furthermore, it increases the operation count and memory required to perform the factorization and, more importantly, it can severely limit the scope for parallelism. Maintaining stability and using static data structures are conflicting objectives.

If no candidate pivot satisfies the threshold test but the pivot that is nearest to satisfying it would satisfy it with a threshold *γ*<sup>1</sup> *< γ* , then provided *γ*<sup>1</sup> is at least some chosen minimum value, **relaxed pivoting** accepts this pivot and reduces *γ* to *γ*1. The new value *γ*<sup>1</sup> is employed thereafter. This means that the factorization is potentially less stable but, with fewer delayed pivots, the factors may be sparser than if the original *γ* was used throughout.

With relaxed pivoting, delayed pivots can still occur and it may not be possible to use static data structures. Static pivoting allows static data structures because it permits no delayed pivots. When a candidate pivot is found to be too small (and no other eligible candidate passes the stability test), **static pivoting** replaces it by a user defined value. A small value may make the factorization more accurate but can lead to large growth in the size of the entries in the factors, while a large value controls this growth but reduces the accuracy of the factorization. As well as allowing the use of a static task graph and the structures predicted by the symbolic factorization, other benefits of static pivoting are improved use of BLAS 3 operations and parallelism and, because there is no additional fill-in, load imbalance in a parallel environment is less likely to be a problem. However, the factorization need not be stable and the factors are of a shifted matrix *A* + *Dδ* where *Dδ* is a diagonal matrix, and it may be necessary to seek to improve the accuracy of the solution using a refinement method (see Section 7.4.1). It is also possible that by the time a very small pivot is found it is too late to save the stability of the factorization and perturbing the pivot effectively just amplifies numerical noise. It is thus essential that static pivoting is used with care; it makes an LDLT or LU direct solver less of a "black box solver" because the guarantees are much weaker than when threshold partial pivoting is used. A more robust approach can be to incorporate the use of shifts into the algorithm that calls the linear system solver. For example, a standard technique in some optimization algorithms that involve symmetric linear systems is to employ regularization. This can avoid the need for an LDLT factorization in favour of a stable Cholesky factorization.

Observe that if an LDLT factorization of a symmetric indefinite matrix *A* is computed, then the **inertia** (that is, the number of positive eigenvalues, negative eigenvalues and eigenvalues equal to zero) of *A* can be found by computing the eigenvalues of the block diagonal factor *D*. In some applications, computing the inertia may be desired. For example, in interior point methods for minimizing a nonlinear objective function subject to constraints, each iteration involves solving a sparse symmetric indefinite linear system and it is important that the solution method for this system accurately reports the inertia to allow parameters within the interior point method to be chosen. One consequence of static pivoting or using a small threshold *γ* is that the computed inertia of *A* is less likely to be accurate.

#### *7.3.4 Special Indefinite Matrices that Avoid Pivoting*

Symmetric saddle point matrices are indefinite matrices of the form

$$A = \begin{pmatrix} G & R^T \\ R & -B \end{pmatrix},\tag{7.10}$$

where *<sup>G</sup>* <sup>∈</sup> <sup>R</sup>*n*1×*n*<sup>1</sup> is an SPD matrix, *<sup>B</sup>* <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>2</sup> is a positive semidefinite matrix (including *<sup>B</sup>* <sup>=</sup> 0), and *<sup>R</sup>* <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>1</sup> with *<sup>n</sup>*<sup>1</sup> <sup>+</sup> *<sup>n</sup>*<sup>2</sup> <sup>=</sup> *<sup>n</sup>*. Such systems include the class of F matrices, where *B* = 0 and each column of *R* has at most two entries, and if there are two entries, they sum to zero. It is of interest to try and symmetrically permute *A* in such a way that the LDLT factorization of the permuted matrix *P AP<sup>T</sup>* exists without the use of threshold pivoting. This is attractive because it then makes the factorization as efficient as for an SPD matrix. *P* = 

Define the permutation matrix *P* to be

$$P = \begin{bmatrix} e\_1, \ e\_{n\_1+1}, \ e\_2, \ e\_{n\_1+2}, \ \dots \ e\_{n\_1}, \ e\_n, \ e\_{n\_2+1}, \ \dots, \ e\_{n\_1} \end{bmatrix}^T.$$

Then the permuted matrix *P AP<sup>T</sup>* has a block form in which each entry *Ai,j* is a 2 × 2 or 2 × 1 or 1 × 2 or 1 × 1 block. In particular, the diagonal blocks are ⎪⎪⎨

$$\begin{array}{ll}\text{matrix }PAP^T \text{ has a block form in which} \\ \times 2 \text{ or } 1 \times 1 \text{ block. In particular, the diagram} \\\\ A\_{i,i} = \begin{cases} \begin{pmatrix} g\_{il} & r\_{li} \\ r\_{li} & -b\_{li} \end{pmatrix}, & 1 \le i \le n\_2 \\\ b\_{li}, & n\_2 + 1 \le i \le n\_1. \end{cases} \end{array}$$

The following theorem shows that a 2×2 pivot updated by the Schur complement of a 1 × 1 pivot is nonsingular and vice versa.

**Theorem 7.5 (Lungten et al. 2018)** *Let A be the symmetric saddle point matrix (7.10). Assume <sup>R</sup>* <sup>=</sup> *(R*<sup>1</sup> *<sup>R</sup>*2*) is of full rank with <sup>R</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>2</sup> *nonsingular. Let <sup>G</sup>* <sup>∈</sup> <sup>R</sup>*n*1×*n*<sup>1</sup> *be SPD and partitioned conformally and let <sup>B</sup>* <sup>∈</sup> <sup>R</sup>*n*2×*n*<sup>2</sup> *be* *positive semidefinite. If A is permuted to the form* ⎝

$$
\begin{pmatrix}
\boldsymbol{G}\_{11} & \boldsymbol{R}\_{1}^{T} & \boldsymbol{G}\_{12} \\
\boldsymbol{R}\_{1} & -\boldsymbol{B} & \boldsymbol{R}\_{2} \\
\hline
\boldsymbol{G}\_{12}^{T} & \boldsymbol{R}\_{2}^{T} & \boldsymbol{G}\_{22}
\end{pmatrix},
$$

⎞

⎠

*then the Schur complement of the symmetric indefinite matrix G*<sup>11</sup> *R<sup>T</sup>* 1 *R*<sup>1</sup> −*B and the Schur complement of the SPD matrix G*<sup>22</sup> *are nonsingular.*

A consequence of Theorem 7.5 is that provided *R* is of full rank and *R*<sup>1</sup> is nonsingular then the LDLT factorization of *P AP<sup>T</sup>* exists, with <sup>2</sup>×<sup>2</sup> pivots and <sup>1</sup>×<sup>1</sup> pivots chosen from the diagonal blocks of *P AP<sup>T</sup>* in any order. Assume all the <sup>2</sup>×<sup>2</sup> pivots are selected ahead of the 1 × 1 pivots. If *B* = 0 and |*rii*| ≥ max*i*≤*j*≤*n*<sup>1</sup> |*rij* | (1 <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> *<sup>n</sup>*2), then the growth factor is bounded by 22*<sup>n</sup>*<sup>2</sup> . 

A potential difficulty is that permutation matrices *Pr* and *Pc* are needed such that *PrRPc* = [*R*<sup>1</sup> *R*2] with *R*<sup>1</sup> nonsingular. If *Pr* and *Pc* can be constructed so that

factor is bounded by  $2^{2n\_2}$ .

At permutation matrices  $P\_r$  and  $P\_c$  are needed such that

$$P\_r R P\_c = \begin{pmatrix} R\_{11} & R\_{12} \\ & R\_{22} \end{pmatrix},\tag{7.11}$$

where *R*<sup>11</sup> is upper triangular with nonzero diagonal entries then the permuted *R* is said to have a **trapezoidal form**. A simple case where *R* can be permuted to this form is if it satisfies the following one-degree principle. Let *R* be of full rank and let G*b(R)* = *(*V*row,* V*col,* E*)* be the bipartite graph of *R* (Section 6.3.1). *R* can be permuted to trapezoidal form if, for *k* = 1*,* 2*,...,n*<sup>1</sup> − 1, the bipartite graph of *R(k)* has at least one vertex *j <sup>k</sup>* <sup>∈</sup> <sup>V</sup>*col* of degree one, where *<sup>R</sup>(*1*)* <sup>=</sup> *<sup>R</sup>* and *<sup>R</sup>(k*+1*)* is obtained by removing from *R(k)* the column vertex *j <sup>k</sup>* and its matched row index *ik* together with all edges involving *j <sup>k</sup>* or *ik*. 

To illustrate this, consider the 6 × 8 matrix *R* in Figure 7.2 and its associated bipartite graph G*b(R)*. The first column vertex with degree one is 2 ; it is matched with the row vertex 4. Deleting 2 and 4 removes edges *(*4*,* 2 *), (*4*,* 3 *), (*4*,* 5 *), (*4*,* 6 *), (*4*,* 8 *)* . Column vertex 3 now has degree one; it is matched with row vertex 6. Repeating the process gives a perfect matching M = {*(*4*,* 2 *), (*6*,* 3 *), (*1*,* 4 *), (*5*,* 5 *), (*2*,* 1 *), (*3*,* 6 *)*} together with row and column matched vertex sets {4*,* <sup>6</sup>*,* <sup>1</sup>*,* <sup>5</sup>*,* <sup>2</sup>*,* <sup>3</sup>} and 2 *,* 3 *,* 4 *,* 5 *,* 1 *,* 6 , respectively, and permutation matrices *Pr* and *Pc* of order 6 and 8 can be defined to obtain the trapezoidal form in Figure 7.2.

If after *<sup>k</sup>* <sup>≥</sup> <sup>1</sup> steps all columns of the reduced matrix *<sup>R</sup>(k)* have degree greater than 1, the permuted matrix has the form (7.11) where *R*<sup>11</sup> is *k* ×*k* upper triangular, *R*<sup>12</sup> is *k* × *(n*<sup>1</sup> − *k)* and the *(n*<sup>2</sup> − *k)* × *(n*<sup>1</sup> − *k)* block *R*<sup>22</sup> has columns of degree greater than one. *n*<sup>1</sup> − *k* steps of Gaussian elimination (with partial pivoting) can be applied to *R*<sup>22</sup> to complete the transformation of *R* to trapezoidal form.

**Figure 7.2** Illustration of permuting a full rank matrix to trapezoidal form using the one-degree principle. The matrix *R* and its bipartite graph G*B(R)* are given. The edges that belong to the perfect matching in G*b(R)* found using the one-degree principle are given by the dashed blue lines; the corresponding matrix entries are in blue. The trapezoidal form comprises a 6 × 6 upper triangular matrix *R*<sup>1</sup> and a 6×2 rectangular matrix *R*2, where *Pr* = [*e*4*, e*6*, e*1*, e*5*, e*2*, e*3] *<sup>T</sup>* and *Pc* = [*e*2*, e*3*, e*4*, e*5*, e*1*, e*6*, e*7*, e*8] are the row and column permutation matrices.

#### **7.4 Solving Ill-Conditioned Problems**

Ill-conditioning is connected to the input data: a problem is ill-conditioned if small changes in the data can lead to large changes in the solution. Assume for the general linear system *Ax* = *b* that *A* and *b* are perturbed by *A* and *b*, respectively, and the corresponding perturbation of the solution *x* is *x*, so that the perturbed problem

$$(A + \Delta A)(\mathbf{x} + \Delta \mathbf{x}) = b + \Delta b \tag{7.12}$$

has been solved. The perturbations in *A* and *b* may include both data uncertainty and algorithmic errors. Rearranging (7.12), we obtain

$$A\Delta \mathbf{x} = \Delta b - \Delta A - \Delta A \Delta \mathbf{x}.$$

Premultiplying by *<sup>A</sup>*−<sup>1</sup> and considering *any* norm *.* and the corresponding subordinate matrix norm yields

$$\|\Delta x\| \le \|A^{-1}\| \left( \|\Delta b\| + \|\Delta A\| \|x\| + \|\Delta A\| \|\Delta x\| \right).$$

It follows that

$$(1 - \|A^{-1}\| \|\Delta A\|) \|\Delta x\| \le \|A^{-1}\| \left(\|\Delta b\| + \|\Delta A\| \|x\|\right)$$

and, provided *A*−1 *A <sup>&</sup>lt;* 1, this gives the following bound on the absolute error

$$\|\Delta x\| \le \frac{\|A^{-1}\|}{\|A^{-1}\| \|\Delta A\|} (\|\Delta b\| + \|\Delta A\| \|x\|).$$

Dividing by *x* and using *b*≤*A x*, yields the relative error bound

$$\|\Delta\mathbf{x}\|/\|\mathbf{x}\| \le \frac{\kappa(A)}{1 - \kappa(A) \|\Delta A\|/\|A\|} \left( \|\Delta A\|/\|A\| + \|\Delta b\|/\|b\| \right),\tag{7.13}$$

where

$$\kappa(A) = \|A\| \, \|A^{-1}\|\tag{7.14}$$

is the **condition number** of the matrix *A*. The inequality (7.13) shows that the condition number is a relative error magnification factor. If we have a stable algorithm, then a neighbouring problem has been solved, that is,

$$\|\Delta A\| / \|A\| + \|\Delta b\| / \|b\| $$

is small. This ensures an accurate solution if *κ(A)* is small. A large condition number means that *A* is close to being singular (*κ(A)* tends to infinity as *A* tends to singularity).

Observe that the condition number is very dependent on the scaling of *A*. Furthermore, *κ(A)* takes no account of the right-hand side vector *b* or the fact that small entries of *A* (including zeros) may be known within much smaller tolerances than larger entries.

If the matrix norm is that induced by the Euclidean norm (that is, the 2-norm *.*2) and *A* is symmetric, then (7.14) becomes

$$\kappa(A) = |\lambda\_{\text{max}}(A)| / |\lambda\_{\text{min}}(A)|,\tag{7.15}$$

**ALGORITHM 7.3 Iterative refinement of the computed solution of** *Ax* = *b* **Input:** The vector *b* and matrix *A*. **Output:** A sequence of approximate solutions *x(*0*) , x(*1*) ,...*.

1: Solve *Ax(*0*)* <sup>=</sup> *<sup>b</sup> <sup>x</sup>(*0*)* is the initial computed solution 2: **for** *k* = 0*,* 1*,...* **do** 3: Compute *<sup>r</sup>(k)* <sup>=</sup> *<sup>b</sup>* <sup>−</sup> *Ax(k)* Residual on iteration *<sup>k</sup>* 4: Solve *A δx(k)* <sup>=</sup> *<sup>r</sup>(k)* Solve correction equation 5: *<sup>x</sup>(k*+1*)* <sup>=</sup> *<sup>x</sup>(k)* <sup>+</sup> *δx(k)* 6: **end for**

where *λ*max*(A)* and *λ*min*(A)* are eigenvalues of *A* of largest and smallest absolute values, respectively. This is called the **spectral condition number** of *A*. It is important when considering convergence of iterative solvers (Section 9.1.2).

#### *7.4.1 Iterative Refinement*

Iterative refinement can be used to overcome matrix ill-conditioning and improve the accuracy of the computed solution. It may also be used after relaxed or static pivoting. The basic method is outlined as Algorithm 7.3. Note that the solvers in Steps 1 and 4 do not have to be the same. The traditional and most common approach is to use the computed factors of *A* in both steps. Alternatively, the factors can be employed as a preconditioner for an iterative solver in Step 4 (preconditioning and iterative solvers are discussed in Chapter 9). Iterative refinement terminates when either the norm of the residual vector *r(k)* is sufficiently close to zero that the corresponding correction *δx(k)* is very small or the chosen maximum number of iterations is reached. If there were no roundoff errors in any of the refinement steps, the process would converge to the correct solution in a single iteration. In practice, the residual generally decreases significantly over the first few iterations before stagnating (i.e. reaching a point after which little further accuracy is achieved). If the required accuracy has not been achieved, then a possible approach is to switch to using the computed factors as a preconditioner for a Krylov subspace solver (see Chapter 9).

Observe that computing *r(k)* in Step 3 uses the original matrix *A* and if the residual is small, a nearby problem will have been solved. This is particularly useful when there is uncertainty in the accuracy of the computed factors as an approximation to *A* (for instance, if threshold pivoting or static pivoting has been employed).

There are a number of variants of iterative refinement that involve using different precisions for all or part of the process. In traditional iterative refinement, the residuals are computed at twice the working precision (the precision at which the data *A*, *b* and the solution *x* are stored). In fixed precision refinement, all computations use the same precision. In mixed precision iterative refinement, the most expensive parts of the computation (the LU factorization of *A* and solving the correction equation) are performed in single precision and the residual computation in double precision. This is attractive because on modern computer architectures single precision arithmetic is usually significantly faster than double precision. Moreover, holding the factors in single precision substantially reduces the memory required and the amount of data movement. The use of half precision (16-bit) arithmetic is also a possibility, assuming it is considerably faster than single precision, with a proportional saving in energy consumption.

#### *7.4.2 Scaling to Reduce Ill-Conditioning*

We have discussed the importance of the condition number *κ(A)*. If it is large, then we would like to reduce it by transforming *A*. An important way of doing this is by scaling *A* before the numerical factorization begins.

Consider two nonsingular *n* × *n* diagonal matrices *Sr* and *Sc*. Diagonal scaling of the system *Ax* = *b* transforms it to

$$S\_r \, A \, S\_c \, \mathbf{y} = S\_r \, b, \qquad \mathbf{y} = S\_c^{-1} \, \mathbf{x}.\tag{7.16}$$

If *A* is symmetric, then selecting *Sr* = *Sc* retains symmetry. For a general *A*, scaling and permuting to bring large entries onto the diagonal can reduce the need for numerical pivoting, resulting in fewer delayed pivots, less fill-in, faster factorization and solve times, and a reduction in the storage requirements. But finding a good scaling can represent a significant overhead (especially within a parallel solver) and there are limits on the reduction in *κ(A)* that can be achieved by scaling, as illustrated by the following result.

**Theorem 7.6 (van der Sluis 1969)** *Let the matrix A be SPD and let DA be the diagonal matrix with entries aii (*1 ≤ *i* ≤ *n). Then for all diagonal matrices D with positive entries*

$$
\kappa \left( D\_A^{-1/2} A \, D\_A^{-1/2} \right) \le n \varepsilon\_{r \max} \kappa \left( D^{-1/2} A \, D^{-1/2} \right),
$$

*where nzrmax is the maximum number of entries in a row of A.*

We remark that the original (unscaled) matrix *A* should be retained for iterative refinement of the computed solution. Using the scaled matrix generally results in a larger residual for the original system because, in effect, a perturbed system is solved.

#### **Equilibration Scaling**

How to find an appropriate scaling is an open question, but a number of heuristics have been proposed. An obvious choice is to seek to balance entries of the scaled matrix *SrASc* to have approximately equal absolute values. This is called (approximate) **equilibration** scaling. It is a natural scaling if the numerical values of the entries of *A* correspond to physical quantities that are measured using different scales.

One approach to equilibration scaling that is relatively cheap as well as easy to implement is to select the diagonal scaling matrices so that the infinity norm of each row and column of the scaled matrix is approximately equal to unity. Algorithm 7.4 presents an iterative procedure for computing such a scaling. Observe that this preserves symmetry. In the nonsymmetric case, Algorithm 7.4 yields the same results when applied to *A* and *AT* in the sense that the scaled matrix obtained for *AT* is the transpose of that for *A*.

The infinity norm in Algorithm 7.4 may be replaced by the 1-norm, resulting in a matrix whose row and column sums are exactly one (this is sometimes called a doubly stochastic matrix). It can be advantageous to combine the use of the infinity and one norms. For example, by performing one step of infinity norm scaling followed by one or more steps of one norm scaling.

#### **ALGORITHM 7.4 Equilibration scaling in the infinity norm**

"

**Input:** The matrix *A* and convergence tolerance *δ >* 0. **Output:** Diagonal scaling matrices *Sr* and *Sc*.

$$1 \colon B^{(\mathcal{l})} = A, \; D^{(\mathcal{l})} = I, \; E^{(\mathcal{l})} = I$$

$$\text{2: } \mathbf{for} \; k = 1, 2, \ldots \\ \mathbf{do}$$


!

"

#### **Matching-Based Scalings**

In Section 6.3.3, we discussed weighted matchings. In particular, the problem of finding a permutation vector *q* that maximizes the product

$$\prod\_{i=1}^{n} |a\_{i\_{q\_i}}|.$$

The entries *aiqi* corresponding to the solution *q* are the matched entries. The dual variables *ui* and *vj* computed by the MC64 algorithm (Algorithm 6.4) that seeks to compute *q* can be used to calculate a scaling as follows. Define the diagonal scaling matrices *Sr* and *Sc* to have entries

$$(S\_r)\_{li} = \exp(u\_l) \quad \text{and} \quad (S\_c)\_{jj} = \exp(v\_j - \log(\max\_l |a\_{lj}|)), \quad 1 \le i, \ j \le n.$$

The entries of the scaled matrix *SrASc* satisfy

$$\left| (S\_r A S\_c)\_{ij} \right| \begin{cases} = 1, & \text{if } (i, j) \in \mathcal{M}, \\ \le 1, & \text{otherwise}, \end{cases}$$

where M is the maximum weighted matching computed by the MC64 algorithm. If *A* is symmetric, let *S* be the diagonal matrix with entries *(S)ii* =

$$(\mathcal{S})\_{li} = \sqrt{(\mathcal{S}\_r)\_{li}(\mathcal{S}\_c)\_{li}}.$$

Then the symmetric matrix *SAS* has the same property.

#### **Combining Matching-Based Scalings and Orderings**

The matching-based ordering and scaling can be used independently but they can also be combined. After scaling, if the matched entries are non-symmetrically permuted onto the diagonal, then because they are large, they provide good pivot candidates for an LU factorization. This approach is commonly used alongside static pivoting to obtain a factorization of a perturbed matrix, followed by iterative refinement to recover the solution to the original system.

In the symmetric indefinite case, symmetry needs to maintained and so the objective is to symmetrically permute a large off-diagonal entry *aij* onto the subdiagonal to give a <sup>2</sup>×<sup>2</sup> block *aii aij aij ajj* that is potentially a good 2×2 candidate pivot. Assume that a matching M has been computed using the MC64 algorithm and let *q* be the corresponding permutation vector. Any diagonal entries that are in the matching are immediately considered as potential 1 × 1 pivots and are held in a set M1. A set M<sup>2</sup> of potential 2 × 2 pivots is then built by expressing *q* in terms of its component cycles. A cycle of length 1 corresponds to an entry *aii* in the matching. A cycle of length 2 corresponds to two vertices *i* and *j* , where *aij* and *aj i* are both in the matching. *k* potential 2 × 2 pivots can be extracted from even cycles of length 2*k* or from odd cycles of length 2*k* + 1. A straightforward

$$
\begin{pmatrix}
\ast & \ast & \ast & \ast \\
\ast & \ast & \ast & \ast \\
\ast & \ast & \ast & \ast
\end{pmatrix}
\qquad
\begin{pmatrix}
\ast & \ast & \ast & \ast \\
\ast & \ast & \ast & \ast \\
\ast & \ast & \ast & \ast
\end{pmatrix}
\qquad
\begin{pmatrix}
\ast & \ast & \ast \\
\ast & \ast & \ast \\
\ast & \ast & \ast
\end{pmatrix}
$$

**Figure 7.3** An illustration of a symmetric matching for a symmetric indefinite matrix. On the left is the matching M returned by the MC64 algorithm and in the centre is a symmetric matching M*<sup>s</sup>* obtained from M. Entries in the matching are in blue. The pairs *(i, j )* = *(*1*,* 2*)* and *(*3*,* 5*)* are possible 2 × 2 pivot candidates. On the right is the compressed matrix that results from combining rows and columns 1 and 2 and rows and columns 3 and 5.

way to do this is to take the first two entries as the first 2 × 2 pivot, the next two as the next 2 × 2 pivot, and so on, until if the cycle is of odd length, a single entry remains, which is added to the set M1. In practice, most cycles in *q* are of length 1 or 2. A simple example is given in Figure 7.3. Here the matching from MC64 is M = {*(*1*,* 2*), (*2*,* 5*), (*3*,* 1*), (*4*,* 4*), (*5*,* 3*)*}, which is nonsymmetric. *q* has one cycle of length 4 (1 → 2 → 5 → 3 → 1) and one of length 1, giving M<sup>1</sup> = {*(*4*,* 4*)*} and M<sup>2</sup> = {*(*1*,* 2*), (*2*,* 1*), (*3*,* 5*), (*5*,* 3*)*}.

Let M*<sup>s</sup>* = M1∪M<sup>2</sup> be the resulting symmetric matching obtained from M and let *Qs* be the corresponding permutation matrix. To combine *Qs* with a fill-reducing ordering (such as nested dissection or minimum degree), *QsAQ<sup>T</sup> <sup>s</sup>* is compressed. The union of the sparsity structure of the two rows and columns belonging to a potential 2 × 2 pivot is built and used as the structure of a single row and column in the compressed matrix. A fill-reducing ordering algorithm is then applied to the (weighted) compressed graph, and the computed permutation is expanded to a permutation *Qf* for *QsAQ<sup>T</sup> <sup>s</sup>* . The final permutation matrix is the product *Qf Qs*. The rows/columns of a potential 2 × 2 pivot are ordered consecutively.

This approach can reduce the overall computational cost when solving tough indefinite systems for which non-matching based orderings require substantial modifications to the pivot sequence during the numerical factorization to maintain stability. Unfortunately, although after applying the matching-based scaling and ordering there are pivot candidates with large entries, there is still no guarantee that the computed pivot sequence will not need modifying during the factorization. An important disadvantage of using matchings are that the numerical values of the entries of *A* are used so that, if a series of matrices with the same sparsity pattern but different numerical values need to be factorized (such as occurs when an iterative method is used to solve a nonlinear system), the whole symbolic factorization phase may have to be rerun for each matrix, potentially adding significantly to the total solution time.

#### **7.5 Notes and References**

There are many related but different results on the stability of matrix factorizations. While the seminal book of Higham (2002) includes component-wise accuracy and stability analysis (see also the classical text (Wilkinson, 1961), which introduced the terms partial pivoting and complete pivoting), the norm-wise results given in Section 7.1 are based on Demmel (1997); see also Watkins (2002).

Rook pivoting is introduced in Neal & Poole (1992) and analysed in Foster (1997). Early pivoting strategies for dense symmetric indefinite systems are presented in Bunch & Parlett (1971), Bunch (1971), and Bunch & Kaufman (1977). Static pivoting in sparse LU factorizations was first proposed by Li & Demmel (1998). A comprehensive overview of threshold-based pivoting strategies for dense and sparse symmetric indefinite problems is given in Ashcraft et al. (1998). This includes symmetric rook pivoting for dense problems and a discussion of the sparse 2 × 2 threshold partial pivoting strategy of Duff & Reid (1983), which was subsequently modified in Duff et al. (1991), and forms the basis of the approach of Section 7.3.2. Further implementation details (including incorporating working with blocks) are found in Reid & Scott (2011) (see also Hogg & Scott, 2013c). More recently, there has been work on new strategies that seek to offer greater potential for exploiting parallelism without sacrificing numerical robustness, including Hogg & Scott (2014), Hogg et al. (2016), and Duff et al. (2018).

Avoiding the need to pivot for special classes of indefinite matrices is from Lungten et al. (2018) (but see also T˚uma, 2002 and de Niet & Wubs, 2009). Duff & Pralet (2005) and Schenk & Gärtner (2006) use weighted matchings for preprocessing, the latter implementing their strategy within the initial version of the solver PARDISO. The HSL mathematical software library (HSL, 2022) includes a number of packages that are designed for symmetric indefinite systems, most notably the multifrontal codes MA57 (Duff, 2004) and HSL\_MA97, and the supernodal DAG-based code HSL\_MA86 (Hogg & Scott, 2013b). In these solvers, the default setting for the threshold pivoting parameter *γ* is 0.01, although when used within the well-known interior point solver (IPOPT, 2022), a value of 10−<sup>8</sup> is recommended (see also Saunders, 1996). Other well-known sparse direct solvers that handle symmetric indefinite systems include MUMPS (2022) and WSMP (2020).

The technique of iterative refinement was introduced by Doolittle (1878). It was probably first used in a computer program for improving the computed solution to a linear system by Wilkinson (1948), during the design and building of the ACE computer at the National Physical Laboratory; see also Wilkinson (1963) and Moler (1967). The book by Higham (2002) is an essential reference. For sparse systems, the paper by Arioli et al. (1989) is of interest. Hogg & Scott (2010) employ iterative refinement within a sparse mixed precision multifrontal solver. More recently, with a focus on dense systems, Carson & Higham (2017, 2018) and Carson et al. (2020) propose an alternative form of mixed precision iterative refinement that is able to handle highly ill-conditioned problems by solving for the correction using the GMRES iterative method preconditioned by the computed LU factors. The survey by Abdelfattah et al. (2021) provides a comprehensive review of work on the use of mixed precision in numerical linear algebra.

For Theorem 7.6, we refer to van der Sluis (1969). The equilibration scaling in the infinite norm that is outlined in Algorithm 7.4 is given by Ruiz (2001) (see also Liu, 2015). Matching-based scalings are presented in Duff & Koster (1999, 2001), but see also Neumaier & Olschowka (1996) as well as the origins of the scaling factors in Edmonds (1965).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Sparse Matrix Ordering Algorithms**

*The computational complexity of obtaining optimal reorderings for performing sparse Gaussian elimination justifies the heuristic nature of all practical reordering algorithms. – Erisman et al. (1987).*

So far, our focus has been on the theoretical and algorithmic principles involved in sparse Gaussian elimination-based factorizations. To limit the storage and the work involved in the computation of the factors and in their use during the solve phase it is generally necessary to reorder (permute) the matrix before the factorization commences. The complexity of the most critical steps in the factorization is highly dependent on the amount of fill-in, as can be seen from the following observation.

**Observation 8.1** *The operations to perform the sparse LU factorization A* = *LU and the sparse Cholesky factorization <sup>A</sup>* <sup>=</sup> *LL<sup>T</sup> are O(*\$*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> <sup>|</sup> *colL*{*<sup>j</sup>* }| |*rowU* {*<sup>j</sup>* }|*) and O(*\$*n <sup>j</sup>*=<sup>1</sup> <sup>|</sup> *colL*{*<sup>j</sup>* }|<sup>2</sup> *) respectively, where* <sup>|</sup>*rowU* {*<sup>j</sup>* }| *and* <sup>|</sup> *colL*{*<sup>j</sup>* }| *are the number of off-diagonal entries in row j of U and column j of L, respectively.*

The problem of finding a permutation to minimize fill-in is NP complete and thus heuristics are used to determine orderings that limit the amount of fill-in; we refer to these as fill-reducing orderings. Frequently, this is done using the sparsity pattern S{*A*} alone, although sometimes for non-definite matrices, it is combined with the numerical factorization because additional permutations of *A* may be needed to make the matrix factorizable. Two main classes of methods that work with S{*A*} are commonly used.


**It is assumed throughout this chapter that** *A* **is irreducible.** Otherwise, if S{*A*} is symmetric, the algorithms are applied to each component of G*(A)* independently and *n* is then the number of vertices in the component. If S{*A*} is nonsymmetric, we assume that *A* is in block triangular form and the algorithms are used on the graph of each block on the diagonal. We also assume that *A* has **no rows or columns that are (almost) dense**. If it does, a simple strategy is to remove them before applying the ordering algorithm to the remaining matrix; the variables corresponding to the dense rows and columns can be appended to the end of the computed ordering to give the final ordering.

Historically, ordering the matrix *A* before using a direct solver to factorize it was generally cheap compared to the numerical factorization cost. However, in the last couple of decades, the development of more sophisticated factorization algorithms and their implementations in parallel on modern architectures has affected this balance so that the ordering can be the most expensive step. If a sequence of matrices having the same sparsity pattern is to be factorized, then the ordering cost and the cost of the symbolic factorization can be amortized over the numerical factorizations. If not, it is important to have available a range of ordering algorithms because using a cheap but less effective algorithm may lead to faster complete solution times compared to using an expensive approach that gives some savings in the memory requirements and operation counts but not enough to offset the ordering cost.

## **8.1 Local Fill-Reducing Orderings for Symmetric** *S***{***A***}**

In the symmetric case, the diagonal entries of *A* are required to be present in S{*A*} (thus zeros on the diagonal are included in the sparsity structure). The aim is to limit fill-in in the L factor of an *LL<sup>T</sup>* (or *LDL<sup>T</sup>* ) factorization of *A*. Two greedy heuristics are the minimum degree (MD) criterion and the local minimum fill (MF) criterion.

#### *8.1.1 Minimum Fill-in (MF) Criterion*

One way to reduce fill-in is to use a local **minimum fill-in** (MF) criterion that, at each step, selects as the next variable in the ordering one that will introduce the least fill-in in the factor at that step. This is sometimes called the **minimum deficiency** approach. While MF can produce good orderings, its cost is often considered to be prohibitive because it requires the updated sparsity pattern and the fill-in associated with the possible candidates must be determined. The runtime can be reduced using an approximate variant (AMF) but it is not widely implemented in modern sparse direct solvers.

#### *8.1.2 Basic Minimum Degree (MD) Algorithm*

The minimum degree (MD) algorithm is the best-known and most widely used greedy heuristic for limiting fill-in. It seeks to find a permutation such that at each step of the factorization the number of entries in the corresponding column of *L* is minimized. This metric is easier and less expensive to compute compared to that used by the minimum fill-in criterion. If G*(A)* is a tree, then the MD algorithm results in no fill-in but, in most real applications, it does not minimize the amount of fill-in exactly.

The MD algorithm can be implemented using G*(A)* and it can predict the required factor storage without generating the structure of *L*. The basic approach is given in Algorithm 8.1. At step *k*, the number of off-diagonal nonzeros in a row or column of the active submatrix is the **current degree** of the corresponding vertex in the elimination graph <sup>G</sup>*k*. The algorithm selects a vertex of minimum current degree in <sup>G</sup>*<sup>k</sup>* and labels it *vk*, i.e. next for elimination. The set of vertices adjacent to *vk* in G*(A)* is R*each(vk,* V*k)*, where V*<sup>k</sup>* is the set of *k* − 1 vertices that have already been eliminated. These are the only vertices whose degrees can change at step *k*. If *u* ∈ R*each(vk,* V*k)*, *u* = *vk*, then its updated current degree is |R*each(u,* V*k*+1*)*|, where V*k*+<sup>1</sup> = V*<sup>k</sup>* ∪ *vk*.

At Step 4 of Algorithm 8.1, a tie-breaking strategy is needed when there is more than one vertex of current minimum degree. A straightforward strategy is to select the vertex that lies first in the original order. For the example in Figure 8.1, vertices 2, 3, and 6 are initially all of degree 2 and could be selected for elimination; as the lowest-numbered vertex, 2 is chosen. After it has been eliminated, vertices 3, 5, and 6 have current degree 2 and so vertex 3 is next. As all the remaining vertices have current degree 2, vertex 1 is eliminated, followed by 4, 5, and 6. It is possible to construct artificial matrices showing that some systematic tie-breaking choices can lead to a large amount of fill-in but such behaviour is not typical.

#### **ALGORITHM 8.1 Basic minimum degree (MD) algorithm**

**Input:** Graph G of a symmetrically structured matrix. **Output:** A permutation vector *p* that defines a new labelling of the vertices of G.

1: Set <sup>G</sup><sup>1</sup> <sup>=</sup> <sup>G</sup> and compute the degree *deg*G<sup>1</sup> *(u)* of all *<sup>u</sup>* <sup>∈</sup> <sup>V</sup>*(*G1*)*

$$\text{2: } \mathbf{for} \; k = 1:n-1 \; \mathbf{do}$$

3: Compute *mdeg* <sup>=</sup> min{*deg*G*<sup>k</sup> (u)*<sup>|</sup> *<sup>u</sup>* <sup>∈</sup> <sup>V</sup>*(*G*k)*} *mdeg* is the current minimum degree

4: Choose *vk* <sup>∈</sup> <sup>V</sup>*(*G*k)* such that *deg*G*<sup>k</sup> (vk)* <sup>=</sup> *mdeg* 5: *p(k)* = *vk vk* is the next vertex in the elimination order

6: Construct <sup>G</sup>*k*+<sup>1</sup> and update the current degrees of its vertices

7: **end for**

8: *p(n)* <sup>=</sup> *vn* where *vn* is the only vertex in <sup>G</sup>*<sup>n</sup>*

**Figure 8.1** An illustration of three steps of the MD algorithm. The original graph G and the elimination graphs <sup>G</sup>2, <sup>G</sup><sup>3</sup> and <sup>G</sup><sup>4</sup> that result from eliminating vertex 2, then vertex 3 and then vertex 1 are shown red dashed lines denote fill edges.

The construction of each elimination graph <sup>G</sup>*k*+<sup>1</sup> is central to the implementation of the MD algorithm. Because eliminating a vertex potentially creates fill-in, an efficient representation of the resulting elimination graph that accommodates this (either implicitly or explicitly) is needed. Moreover, recalculating the current degrees is time consuming. Consequently, various approaches have been developed to enhance performance; these are discussed in the following subsections.

#### *8.1.3 Use of Indistinguishable Vertices*

In Section 3.5.1, we introduced indistinguishable vertices and supervariables. The importance of exploiting these in MD algorithms is emphasized by the next two results. Here G*<sup>v</sup>* denotes the elimination graph obtained from G when vertex *v* ∈ V*(*G*)* is eliminated.

**Theorem 8.1 (George & Liu 1980b, 1989)** *Let u and w be indistinguishable vertices in* G*. If v* ∈ V*(*G*) with v* = *u, w, then u and w are indistinguishable in* G*v.*

*Proof* Two cases must be considered. First, let *u* ∈ *adj*G{*v*}. Then *w* ∈ *adj*G{*v*} and if *v* is eliminated, the adjacency sets of *u* and *w* are unchanged and these vertices remain indistinguishable in the resulting elimination graph G*v*. Second, let *u, w* ∈ *adj*G{*v*}. When *v* is eliminated, because *u* and *w* are indistinguishable in G, their adjacency sets in G*<sup>v</sup>* will be modified in the same way, by adding the entries of *adj*G{*v*} that are not already in *adj*G{*u*} and *adj*G{*w*}. Consequently, *u* and *w* are indistinguishable in G*v*.

Figure 8.2 demonstrates the two cases in the proof of Theorem 8.1. Here, *u* and *w* are indistinguishable vertices in G. Setting *v* ≡ *v* illustrates *u* ∈ *adj*G{*v*}. If *v* is eliminated, then the adjacency sets of *u* and *w* are clearly unchanged. Setting *v* ≡ *v* illustrates *u, w* ∈ *adj*G{*v*}. In this case, if *v* is eliminated, then vertices *s* and *t* are added to both *adj*G{*u*} and *adj*G{*w*}.

**Figure 8.2** An example to illustrate Theorem 8.1. *u* and *w* are indistinguishable vertices in G; *adj*G{*u*}={*r, w, v*} and *adj*G{*w*}={*r, u, v*}.

**Figure 8.3** An illustration of Theorem 8.2. Vertices *u* and *w* are of minimum degree (with degree *mdeg* = 3) and are indistinguishable in G. After elimination of *w*, the current degree of *u* is *mdeg* − 1 and the current degree of each of the other vertices is at most *mdeg* − 1. Therefore, *u* is of current minimum degree in G*w*. Note that vertices *r* and *v* are also of minimum degree and indistinguishable in G; they are not neighbours of *w* and their degrees do not change when *w* is eliminated.

**Theorem 8.2 (George & Liu 1980b, 1989)** *Let u and w be indistinguishable vertices in* G*. If w is of minimum degree in* G*, then u is of minimum degree in* G*w.*

*Proof* Let *deg*G*(w)* = *mdeg*. Then *deg*G*(u)* = *mdeg*. Indistinguishable vertices are always neighbours. Eliminating *w* gives *deg*G*<sup>w</sup> (u)* = *mdeg* − 1 because *w* is removed from the adjacency set of *u* and there is no neighbour of *u* in G*<sup>w</sup>* that was not its neighbour in G. If *x* = *w* and *x* ∈ *adj*G{*u*}, then the number of neighbours of *x* in G*<sup>w</sup>* is at least *mdeg* − 1. Otherwise, if *x* ∈ *adj*G{*u*}, then its adjacency set in G*<sup>w</sup>* is the same as in G and is of the size at least *mdeg*. The result follows.

Theorem 8.2 is illustrated in Figure 8.3.

Theorems 8.1 and 8.2 can be extended to more than two indistinguishable vertices, which allows indistinguishable vertices to be selected one after another in the MD ordering. This is referred to as **mass elimination**. Treating indistinguishable vertices as a single supervariable cuts the number of vertices and edges in the elimination graphs, which reduces the work needed for degree updates.

In the basic MD algorithm, the current degree of a vertex is the number of adjacent vertices in the current elimination graph. The **external degree** of a vertex is the number of vertices adjacent to it that are not indistinguishable from it. The motivation comes from the underlying reason for the success of the minimum degree ordering in terms of fill reduction. Eliminating a vertex of minimum degree implies the formation of the smallest possible clique resulting from the elimination. If mass elimination is used, then the size of the resulting clique is equal to the external degree of the vertices eliminated by the mass elimination step. Using the external degree can speed up the time for computing the ordering and give worthwhile savings in the number of entries in the factors.

#### *8.1.4 Degree Outmatching*

A concept that is closely related to that of indistinguishable vertices is **degree outmatching**. This avoids computing the degrees of vertices that are known not to be of current minimum degree. Vertex *w* is said to be outmatched by vertex *u* if

$$ad j\_{\mathcal{G}}\{\mu\} \cup \{\mu\} \subseteq ad j\_{\mathcal{G}}\{w\} \cup \{w\}.$$

It follows that *deg*G*(u)* ≤ *deg*G*(w).* A simple example is given in Figure 8.4. Importantly, degree outmatching is preserved when vertex *v* ∈ G of minimum degree is eliminated, as stated in the following result.

**Theorem 8.3 (George & Liu 1980b, 1989)** *In the graph* G *let vertex w be outmatched by vertex u and vertex v (v* = *u, w) be of minimum degree. Then w is outmatched in* G*<sup>v</sup> by u.*

*Proof* Three cases must be considered. First, if *u /*∈ *adj*G{*v*} and *w /*∈ *adj*G{*v*}, then the adjacency sets of *u* and *w* in G*<sup>v</sup>* are the same as in G. Second, if *v* is a neighbour of both *u* and *w* in G, then any neighbours of *v* that were not neighbours of *u* and

**Figure 8.4** An example G in which vertex *w* is outmatched by vertex *u*. *v* is not a neighbour of *u* or *w*; vertex *v* is a neighbour of both *u* and *w*; *v* is a neighbour of *w* but not of *u*.

*w* are added to their adjacency sets in G*v*. Third, if *u /*∈ *adj*G{*v*} and *w* ∈ *adj*G{*v*}, then the adjacency set of *u* in G*<sup>v</sup>* is the same as in G but any neighbours of *v* that were not neighbours of *w* are added to the adjacency set of *w* in G*v*. In all three cases, *w* is still outmatched by *u* in G*v*.

The three possible cases for *v* in the proof of Theorem 8.3 are illustrated in Figure 8.4 by setting *v* ≡ *v* , *v* and *v*, respectively. An important consequence of Theorem 8.3 is that if *w* is outmatched by *u*, then it is not necessary to consider *w* as a candidate for elimination and all updates to the data structures related to *w* can be postponed until *u* has been eliminated.

#### *8.1.5 Cliques and Quotient Graphs*

From Parter's rule, if vertex *v* is selected at step *k*, then the elimination matrix that corresponds to <sup>G</sup>*k*+<sup>1</sup> contains a dense submatrix of size equal to the number of offdiagonal entries in row and column *<sup>v</sup>* in the matrix that corresponds to <sup>G</sup>*k*. For large matrices, creating and explicitly storing the edges in the sequence of elimination graphs is impractical and a more compact and efficient representation is needed. Each elimination graph can be interpreted as a collection of cliques, including the original graph G, which can be regarded as having |E| cliques, each consisting of two vertices (or, equivalently, an edge). This gives a conceptually different view of the elimination process and provides a compact scheme to represent the elimination graphs. The advantage in terms of storage is based on the following.

Let {V1*,* V2*,...,* V*<sup>q</sup>* } be the set of cliques for the current graph and let *v* be a vertex of current minimum degree that is selected for elimination. Let {V*s*<sup>1</sup> *,* V*s*<sup>2</sup> *,...,* V*st*} be the subset of cliques to which *v* belongs. Two steps are then required.

1. Remove the cliques {V*s*<sup>1</sup> *,* V*s*<sup>2</sup> *,...,* V*st*} from {V1*,* V2*,...,* V*<sup>q</sup>* }.

2. Add the new clique V*<sup>v</sup>* = {V*s*<sup>1</sup> ∪ *...* ∪ V*st*}\{*v*} into the set of cliques.

Hence

$$\begin{aligned} &= \{ \mathcal{V}\_{s\_1} \cup \dots \cup \mathcal{V}\_{s\_l} \} \mid \{ v \} \text{ into} \\\\ &\stackrel{\text{def}}{=} \| \mathcal{V}\_v \| < \sum\_{i=1}^l |\mathcal{V}\_{s\_i}|, \end{aligned}$$

and because {V*s*<sup>1</sup> *,* V*s*<sup>2</sup> *,...,* V*st*} can now be discarded, the storage required for the representation of the sequence of elimination graphs never exceeds that needed for G*(A)*. The storage to compute an MD ordering is therefore known beforehand in spite of the rather dynamic nature of the elimination process. The index of the eliminated vertex can be used as the index of the new clique. This is called an **element** or **enode** (the terminology comes from finite-element methods), to distinguish it from an uneliminated vertex, which is termed an **snode**.

A sequence of special quotient graphs <sup>G</sup>[1] <sup>=</sup> <sup>G</sup>*(A),* <sup>G</sup>[2] *,...,* <sup>G</sup>[*n*] with the two types of vertices is generated in place of the elimination graphs. Each <sup>G</sup>[*k*] has *<sup>n</sup>* vertices that satisfy

$$\mathcal{V}(\mathcal{G}) = \mathcal{V}\_{s nodes} \cup \mathcal{V}\_{enodes}, \qquad \mathcal{V}\_{s nodes} \cap \mathcal{V}\_{enodes} = \emptyset,$$

where V*snodes* and V*enodes* are the sets of snodes and enodes, respectively. When *v* is eliminated, any enodes adjacent to it are no longer required to represent the sparsity pattern of the corresponding active submatrix and so they can be removed. This is called **element absorption**.

Working with these graphs can be demonstrated by considering the computation of the vertex degrees. To compute the degree of an uneliminated vertex, the set of neighbouring snodes is counted. Then, if a neighbour of one of these snodes is an enode, its neighbours are also counted (avoiding double counting). More formally, if *v* ∈ V*snodes*, then the adjacency set of *v* is the union of its neighbours in V*snodes* and the vertices reachable from *v* via its neighbours in V*enodes*. In this way, vertex degrees are computed by considering fill-paths, avoiding storing the fill-in entries explicitly. This reduces memory requirements and contributes to the computational efficiency, which can be further improved by amalgamating sets of indistinguishable enodes and snodes.

The sequences of elimination graphs and quotient graphs are illustrated in Figure 8.5. The top line shows <sup>G</sup> together with <sup>G</sup><sup>2</sup> and <sup>G</sup><sup>3</sup> after the elimination of vertices 1 and 2, respectively. When vertex 1 is eliminated, a new edge is added to make its neighbours into a clique. The elimination of vertex 2 creates no additional fill and the graph <sup>G</sup><sup>3</sup> with three nodes represents the sparsity structure of the corresponding active submatrix *A(*3*)* . The bottom line shows the corresponding quotient graphs. After the first elimination, vertex 1 is an enode and the fill edge is represented implicitly. After the second elimination, the enodes 1 and 2 can be amalgamated and so too can the snodes 3 and 4 because they are indistinguishable.

**Figure 8.5** The top line shows <sup>G</sup> <sup>=</sup> <sup>G</sup>1, <sup>G</sup><sup>2</sup> and <sup>G</sup>3. The red dashed line denotes a fill edge. The bottom line shows the quotient graphs <sup>G</sup>[2] and <sup>G</sup>[3] after the first and second elimination steps. A circle represents a vertex in G (an snode), while a square represents an enode.

**ALGORITHM 8.2 Basic multiple minimum degree (MMD) algorithm Input:** Graph G of a symmetrically structured matrix. **Output:** A permutation vector *p* that defines a new labelling of the vertices of G. 1: Set *<sup>k</sup>* <sup>=</sup> 1, <sup>G</sup><sup>1</sup> <sup>=</sup> <sup>G</sup> and compute the degree *deg*G<sup>1</sup> *(u)* of all *<sup>u</sup>* <sup>∈</sup> <sup>V</sup>*(*G1*)* 2: **while** *k* ≤ *n* **do** 3: Compute *mdeg* <sup>=</sup> min{*deg*G*<sup>k</sup> (u)*<sup>|</sup> *<sup>u</sup>* <sup>∈</sup> <sup>V</sup>*(*G*k)*} 4: Find all mutually non-adjacent *vj* <sup>∈</sup> <sup>V</sup>*(*G*k)*, *<sup>j</sup>* <sup>=</sup> <sup>1</sup>*,...,t* with *deg*G*<sup>k</sup> (vj )* <sup>=</sup> *mdeg*

```
5: for j = 1 : t do
```
6: *p(k)* = *vj* Vertex *vj* is the next vertex in the elimination order

```
7: k = k + 1
```
8: **end for**

```
9: if k<n then
```

```
10: Construct Gk+1 and update the current degrees of its vertices
```

```
11: end if
```

```
12: end while
```
### *8.1.6 Multiple Minimum Degree (MMD) Algorithm*

The multiple minimum degree (MMD) algorithm aims to improve efficiency by processing several independent vertices that are each of minimum current degree together in the same step, before the degree updates are performed. The basic approach is outlined as Algorithm 8.2. At each outer loop, *t* ≥ 1 denotes the number of vertices of minimum current degree that are mutually non-adjacent and so can be put into the elimination order one after another. An example in which the four corner vertices have the same minimum degree is depicted in Figure 8.6. Here, on the first step, *mdeg* = 2 and *t* = 4. Note that the MMD strategy is complementary to the mass elimination approach in which the set *S* of indistinguishable vertices that can be eliminated one after another is fully interconnected and all vertices of *S* have the same set of neighbours outside *S*.

**Figure 8.6** The red (corner) vertices of G are each of degree 2 and are ordered consecutively during the first step of Algorithm 8.2.

The complexity of the MD and MMD algorithms is *O(nz(A)n*2*)* but because for MMD the outer loop of the algorithm update is performed fewer times, it can be significantly faster than MD. MMD orderings can also lead to less fill-in, possibly a consequence of introducing some kind of regularity into the ordering sequence.

### *8.1.7 Approximate Minimum Degree (AMD) Algorithm*

The idea behind the widely used **approximate minimum degree** (AMD) algorithm is to inexpensively compute an upper bound on a vertex degree in place of the degree, and to use this bound as an approximation to the external degree when selecting vertices within the MD algorithm. Even though vertex degrees are not determined exactly, the quality of the orderings obtained using the AMD algorithm are competitive with those computed using the MD algorithm and can surpass them. The complexity of AMD is *O(nz(A)n)* and, in practice, its runtime is typically significantly less than that of the MD and MMD approaches.

#### **8.2 Minimizing the Bandwidth and Profile**

An alternative way of reducing the fill-in locally is to add another criterion to the relabelling of the vertices, such as restricting the nonzeros of the permuted matrix to specific positions. The most popular approach is to force them to lie close to the main diagonal. If Gaussian elimination is applied without further permutations, then all fill-in takes place between the first entry of a row and the diagonal or between the first entry of a column and the diagonal. It is therefore sufficient to store all the entries in the lower triangular part from the first entry in each row to the diagonal and all the entries in the upper triangular part from the first entry in each column to the diagonal. This allows straightforward implementations of Gaussian elimination that employ static data structures. Here we again consider symmetric and, for simplicity, we assume that G*(A)* is connected; generalizations of the terminology and ideas to nonsymmetric matrices are possible.

#### *8.2.1 The Band and Envelope of a Matrix*

To characterize the positions within S{*A*} that are close to the main diagonal, we denote the leftmost entries in the lower triangular part of *A* using the mapping *ηi* as follows:

$$\eta\_l(A) = \min\{j \mid 1 \le j \le i \text{ with } a\_{lj} \ne 0\}, \quad 1 \le i \le n,\tag{8.1}$$

that is, *ηi(A)* is the column index of the first entry in the *i*-th row of *A*.

Define

$$
\beta\_l(A) = i - \eta\_l(A), \quad 1 \le i \le n.
$$

The **semibandwidth** of *A* is

$$\max \{ \beta\_i(A) \mid 1 \le i \le n \},$$

and the **bandwidth** is

$$\beta(A) = 2 \ast \max \{ \beta\_l(A) \mid 1 \le i \le n \} + 1.$$

The **band** of *A* is the following set of index pairs in *A*

$$\text{band}(A) = \{(i, j) \mid 0 < i - j \le \beta(A)\}.$$

The **envelope** is the set of index pairs that lie between the first entry in each row and the diagonal

$$lenv(A) = \{(i, j) \mid 0 < i - j \le \beta\_l(A)\}.$$

Note that the band and envelope of a sparse symmetrically structured matrix *A* include only entries of the strict lower triangular part of *A*. The envelope is easily visualized: picture the lower triangular part of *A*, and remove the diagonal and the leading zero entries in each row. The remaining entries (whether nonzero or zero) comprise the envelope of *A*. The **profile** of *A* is defined to be the number of entries in the envelope (the envelope size) plus *n*. <sup>1</sup> An illustrative example is given in Figure 8.7. Here *η*1*(A)* = 1, *β*1*(A)* = 0, *η*2*(A)* = 1, *β*2*(A)* = 1, *η*3*(A)* = 2, *β*3*(A)* = 1, and so on.


**Figure 8.7** Illustration of the band and envelope of a matrix *A* whose sparsity pattern is on the left. In the centre, the positions of *band(A)* are circled and on the right, the positions of *env(A)* are circled. The bandwidth is 5 and the envelope size and the profile are 7 and 14, respectively.

<sup>1</sup> Sometimes in the literature the profile is defined to be the envelope size.

The next result shows that the static data structures determined for *A* are sufficient for its Cholesky factors and by permuting *A* to minimize its band or profile, the fill-in is also approximately minimized.

**Theorem 8.4 (Liu & Sherman 1976; George & Liu 1981)** *If L is the Cholesky factor of A, then*

$$\operatorname{env}(A) = \operatorname{env}(L).$$

*Proof* The proof uses mathematical induction on the principal leading submatrices of *A* of order *k*. The result is clearly true for *k* = 1 and *k* = 2. Assume it holds for 2 ≤ *k<n* and consider the block factorization 

$$
\begin{pmatrix}
\begin{matrix}
A\_{1:k,1:k} & u\_{1:k} \\
u\_{1:k}^T
\end{matrix}
\end{pmatrix} = \begin{pmatrix}
L\_{1:k,1:k} & 0 \\
v\_{1:k}^T & \beta
\end{pmatrix} \begin{pmatrix}
L\_{1:k,1:k}^T & v\_{1:k} \\
0 & \beta
\end{pmatrix},
$$

where *α* and *β* are scalars. Equating the left and right sides, *L*1:*k,*1:*kv*<sup>1</sup>:*<sup>k</sup>* = *u*1:*k.* Because *uj* = 0 for *j<ηk*+1*(A)* and *uηk*+<sup>1</sup> = 0, it follows that *vj* = 0 for *j<ηk*+1*(A)* and *vηk*+<sup>1</sup> = 0. This proves the induction step.

A straightforward corollary of Theorem 8.4 is that *band(A)* = *band(L).*

#### *8.2.2 Level-Based Orderings*

Finding a permutation *P* to minimize the band or profile of *P AP<sup>T</sup>* is combinatorially hard and again heuristics are used to efficiently find an acceptable *P*. The popular Cuthill McKee (CM) approach chooses a suitable starting vertex *s* and labels it 1. Then, for *i* = 1*,* 2*,...,n* − 1, all vertices adjacent to vertex *i* that are still unlabelled are labelled successively in order of increasing degree, as described in Algorithm 8.3. A very important variation is the Reverse Cuthill McKee (RCM) algorithm, which incorporates a final step in which the CM ordering is reversed. The CM- and RCM-permuted matrices have the same bandwidth but the latter can decrease the envelope, as demonstrated in Figure 8.8.

The importance of the CM and RCM orderings is expressed in the following theorem. The full envelope of the Cholesky factor of the permuted matrix implies cache efficiency when performing the triangular solves once the factorization is complete.

**Theorem 8.5 (Liu & Sherman 1976; George & Liu 1981)** *Let A be symmetrically structured and irreducible. If P corresponds to the CM labelling obtained from Algorithm 8.3 and L is the Cholesky factor of P<sup>T</sup> AP, then env(L) is full, that is, all entries of the envelope are nonzero.*


**Figure 8.8** An example to illustrate Algorithm 8.3. The starting vertex is *s* = 3; it has degree 1. The graph G*(A)* is given and the sparsity patterns of *A* (left), *A* symmetrically permuted by the CM algorithm (centre) and *A* symmetrically permuted by the RCM algorithm (right). The profiles of these matrices are 25, 17, and 16, respectively.

A crucial difference between profile reduction ordering algorithms and minimum degree strategies is that the former is based solely on G: the costly construction of quotient graphs is not needed. However, unless the profile after reordering is very small, there can be significantly more fill-in in the factor.

Key to the success of Algorithm 8.3 is the choice of the starting vertex *s*: the quality of the ordering is highly dependent on *s*. A good candidate is a vertex for which the maximum distance between it and some other vertex in G is large. Formally, the **eccentricity** *(u)* of the vertex *u* in the connected undirected graph G is defined to be

$$\epsilon(\mu) = \max \{ d(\mu, v) \mid v \in \mathcal{V} \},$$

where *d(u, v)* is the distance between the vertices *u* and *v* (the length of the shortest path between these vertices). The maximum eccentricity taken over all the vertices is the **diameter** of G (that is, the maximum distance between any pair of vertices). The endpoints of a diameter (also termed **peripheral vertices**) provide good starting vertices. The complexity of finding a diameter is *O(n*3*)* because the shortest paths amongst all the vertices have to be checked. Thus, a pseudo-diameter defined by any pair of vertices for which *d(u, v)* is close to the diameter is used instead. The vertices defining a pseudo-diameter are **pseudo-peripheral** vertices.

#### **ALGORITHM 8.3 CM and RCM algorithms for band and profile reduction**

**Input:** Graph G of a symmetrically structured irreducible matrix and a starting vertex *s*.

**Output:** Permutation vectors *pcm* and *prcm* that define new labellings of the vertices of G*(A)*.

1: *label(*1 : *n)* = *f alse* 2: Compute *adj*G{*u*} and *deg*G*(u)* for all *u* ∈ V*(*G*)* 3: *k* = 1, *v*<sup>1</sup> = *s*, *pcm(*1*)* = *v*1, *label(v*1*)* = *true* 4: **for** *i* = 1 : *n* − 1 **do** 5: **for** *w* ∈ *adj*G{*vi*} with *label(w)* = *f alse* in order of increasing degree **do** 6: *k* = *k* + 1, *vk* = *w*, *pcm(k)* = *vk*, *label(vk)* = *true* 7: **end for** 8: **end for** 9: For the RCM ordering, *prcm(i)* = *pcm(n* − *i* + 1*)*, *i* = 1*,* 2*,...,n*.

A heuristic algorithm is used to find pseudo-peripheral vertices. A commonly used approach is based on level sets. A level structure rooted at a vertex *r* is defined as the partitioning of V into disjoint **levels** L1*(r),*L2*(r), . . . ,*L*h(r)* such that


The level structure rooted at *r* may be expressed as the set L*(r)* = {L1*(r),*L2*(r), . . . ,*L*h(r)*}, where *h* is the total number of levels and is termed the **depth**. The level sets can be found using a breadth-first search that starts at the root *r*. The Gibbs-Poole-Stockmeyer (GPS) algorithm presented as Algorithm 8.4 can be used to finding pseudo-peripheral vertices, one of which may then be used as a starting vertex for the CM and RCM algorithms. Here the root vertex *r* is normally taken to be an arbitrary vertex of minimum degree. L*(r)* is constructed and then the level structures rooted at each of the vertices in the last level set L*h(r)*. If, for some *w* ∈ L*h(r)*, the depth of L*<sup>w</sup>* exceeds that of L*(r)*, *w* replaces *r* as the root vertex, and the procedure is repeated. If no such vertex is found, *r* is chosen as a pseudo-peripheral vertex.

A simple example is given in Figure 8.9. Starting with *r* = 2, after two passes through the while loop, the GPS algorithm returns *s* = 8 and *t* = 1 as pseudoperipheral vertices.

To obtain an efficient implementation of the GPS algorithm, it is necessary to limit the number of level set structures that are fully constructed. For example, "short circuiting" can be incorporated in which wide level structures are rejected as soon as they are detected (wide levels will not lead to a deep level structure which is

#### **ALGORITHM 8.4 Basic GPS algorithm to find a pair of pseudo-peripheral vertices**

**Input:** Graph G of a symmetrically structured irreducible matrix and a root vertex *r*.

**Output:** Pseudo-peripheral vertices *s,t*.

1: Construct L*(r)* and set *f lag* = *f alse* 2: **while** *f lag* = *f alse* **do** 3: *f lag* = *true* 4: **for** *i* = 1 : |L*(r)*| **do** 5: *wi* ∈ L*(r)* Select vertex *wi* from last level set 6: **if** *f lag* = *true* **then** 7: Construct L*(wi)* 8: **if** *depth(*L*(wi)) > depth(*L*(r))* **then** 9: *f lag* = *f alse* Flag that *wi* will be used as new initial vertex 10: **end if** 11: **end if** 12: **end for** 13: **if** *f lag* = *true* **then** 14: *s* = *r* and *t* = *wi s* is chosen; while loop will terminate algorithm 15: **else** 16: *r* = *wi* 17: **end if** 18: **end while**

**Figure 8.9** An example to illustrate Algorithm 8.4 for finding pseudo-peripheral vertices. With root vertex *r* = 2, the first level set structure is L*(*2*)* = {{2}*,*{1*,* 3}*,*{4*,* 5*,* 7}*,*{6*,* 8}}. Setting *r* = 8 at Step 16, the second level set structure is L*(*8*)* = {{8}*,*{4*,* 7}*,*{3*,* 6}*,*{2*,* 5}*,*{1}} and the algorithm terminates with *s* = 8 and *t* = 1.

needed for a narrow band). Furthermore, to reduce the number of vertices in the last level set L*h(r)* for which it is necessary to generate the rooted level structures, a "shrinking" strategy can be used. This typically involves considering the degrees of the vertices in L*h(r)* (for example, only those of smallest degree will be tried). Such modifications can lead to significant time savings while still returning a good starting vertex for the CM and RCM algorithms. As with the MD algorithm, tiebreaking rules must be built into any implementation.

#### *8.2.3 Spectral Orderings*

Spectral methods offer an alternative approach that does not use level structures. The spectral algorithm associates a positive semidefinite Laplacian matrix *Lp* with the symmetric matrix *A* as follows: ⎧⎪⎪⎨

$$(L\_p)\_{ij} = \begin{cases} -1 & \text{if } i \neq j \text{ and } a\_{ij} \neq 0, \\ \deg\_{\mathcal{G}}(i) & \text{if } i = j, \\ 0 & \text{otherwise.} \end{cases}$$

An eigenvector corresponding to the smallest positive eigenvalue of the Laplacian matrix is called a **Fiedler vector.** If G is connected, *Lp* is irreducible and the second smallest eigenvalue is positive. The vertices of G are ordered by sorting the entries of the Fiedler vector into monotonic order. Applying the permutation symmetrically to *A* yields the spectral ordering.

The use of the Fiedler vector for reordering *A* comes from considering the matrix envelope. The size of the envelope can be written as

$$\begin{aligned} & \stackrel\frown{\text{der vector for reordering } A} \text{ comes from } \alpha\\ & \text{the envelope can be written as} \\ & |\text{en}\upsilon(A)| = \sum\_{i=1}^n \beta\_i = \sum\_{i=1}^n \max\_{\substack{k$$

Observation 8.1 implies that the asymptotic upper bound on the operation count for the factorization based on *env(A)* is

iies that the asymptotic upper bound on the 
$$\text{and on } env(A) \text{ is }$$

$$work\_{env} = \sum\_{i=1}^{n} \beta\_i^2 = \sum\_{i=1}^{n} \max\_{\substack{k$$

Ordering the vertices using the Fiedler vector is closely related to minimizing *weightenv* over all possible vertex reorderings, where 

$$\begin{array}{l}\text{sing the Fieder vector is closed} \\ \text{ble vertex reordering, where} \\\\weight\_{env} = \sum\_{i=1}^{n} \sum\_{\substack{k$$

Thus, while minimizing the profile and envelope is related to the infinity norm, minimizing *weightenv* is related to the Euclidean norm of the distance between graph vertices.

Although computing the Fiedler vector can be computationally expensive it does have the advantage of easy vectorization and parallelization and the resulting ordering can give small profiles and low operation counts.

## **8.3 Local fill-reducing orderings for nonsymmetric** *S***{***A***}**

If S{*A*} is nonsymmetric, then an often-used strategy is to apply the minimum degree algorithm (or one of its variants) or a band or profile-reducing ordering to the undirected graph <sup>G</sup>*(A*+*AT )*. This can work well if the symmetry index *s(A)* is close to 1. But if *A* is highly nonsymmetric (typically, for values of *s(A)* less than 0.5, *A* is considered to be highly nonsymmetric), then a different approach is required. **Markowitz pivoting** generalizes the MD algorithm by choosing the pivot entry based on vertex degrees computed directly from the nonsymmetric S{*A*}; the result is a nonsymmetric permutation. It can be described using a sequence of bipartite graphs of the active submatrices but here we use a matrix-based description that permutes *A* on-the-fly. Note that Markowitz pivoting is generally incorporated into the numerical factorization phase of an LU solver, rather than being used to derive an initial reordering of *A*.

At step *k* of the LU factorization, consider the *(n* − *k* + 1*)* × *(n* − *k* + 1*)* active submatrix, that is, the Schur complement *S(k)* given by (3.2). Let *nz(rowi)* and *nz(colj )* denote the number of entries in row *<sup>i</sup>* and column *<sup>j</sup>* of *<sup>S</sup>(k)* (1 <sup>≤</sup> *i, j* <sup>≤</sup> *<sup>n</sup>*<sup>−</sup> *<sup>k</sup>* <sup>+</sup> 1). Markowitz pivoting selects as the *<sup>k</sup>*-th pivot the entry of *<sup>S</sup>(k)* that minimizes the **Markowitz count** given by the product

$$(nz(row\_l) - 1)(nz(col\_j) - 1)...$$

This strategy is summarized in Algorithm 8.5 and illustrated in Figure 8.10. Here the first pivot is *a*<sup>24</sup> with Markowitz count 1; it does not cause fill-in. The second pivot has Markowitz count 2 in *S(*2*)* ; it results in one filled entry. Note that the interchanges of rows and columns that are potentially performed at each of the first *n* − 1 steps of the factorization give the row and column permutation matrices on the output of Algorithm 8.5. Implementation of the algorithm requires access to the rows and the columns of the matrix.

#### **ALGORITHM 8.5 Markowitz pivoting**

**Input:** Matrix *A* with a nonsymmetric sparsity pattern.

**Output:** *A* = *P AQ*, where *P* and *Q* are permutation matrices chosen to limit fill in.



**Figure 8.10** Illustration of Markowitz pivoting. The first and second pivots are circled. The sparsity pattern of *<sup>A</sup>* <sup>=</sup> *<sup>S</sup>(*1*)* is on the left. In the centre is the sparsity pattern after permuting the pivot in position *(*2*,* 4*)* to the *(*1*,* 1*)* position of *S(*1*)* . There is no fill-in after the first factorization step. On the right is the sparsity pattern after selecting the second pivot that has the original position *(*4*,* 2*)* and permuting it to the *(*1*,* 1*)* position of *S(*2*)* . The resulting filled entry is denoted by *f* . Note that the nonsymmetric permutations transform the originally irreducible matrix into a reducible one.

Markowitz pivoting as described here only considers the sparsity of *A* and the subsequent Schur complements. In practice, the pivoting strategy also needs to avoid small pivots because, as discussed in the last chapter, they can lead to numerical instability. A simple improvement is to break ties in Step 4 by choosing from the entries with the minimum Markowitz count the one of largest absolute value.

Because computing row and column counts is expensive, practical implementations may restrict computing them to a limited number of rows and columns. Alternatively, the search may be restricted to a predetermined number of rows of lowest row count (typically two or three rows), choosing entries with best Markowitz count and breaking ties on numerical grounds. Another option is to restrict the pivot choice to diagonal entries, in which case *A* is permuted symmetrically.

Algorithm 8.5 needs storage formats that can accommodate dynamic changes to the Schur complements. For example, the DS format described in Section 1.3.2, which allows access to both the rows and the columns. However, this format is only feasible if the amount of fill-in during the factorization is not large.

#### **8.4 Global Nested Dissection Orderings**

Nested dissection is the most important and widely used global ordering strategy for direct methods when S{*A*} is symmetric; it is particularly effective for ordering very large matrices. It proceeds by identifying a small set of vertices V<sup>S</sup> (known as a **vertex separator**) that if removed separates the graph into two disjoint subgraphs described by the vertex subsets B and W (commonly called "black" and "white", respectively). The rows and columns belonging to B are labelled first, then those belonging to W and finally those in VS. The reordered matrix has the form ⎛⎜⎝⎞⎟⎠

$$
\begin{pmatrix}
A\_{\mathcal{B},\mathcal{B}} & 0 & A\_{\mathcal{B},\mathcal{V}\_{\mathcal{S}}} \\ 0 & A\_{\mathcal{W},\mathcal{W}} & A\_{\mathcal{W},\mathcal{V}\_{\mathcal{S}}} \\ A\_{\mathcal{B},\mathcal{V}\_{\mathcal{S}}}^{T} & A\_{\mathcal{W},\mathcal{V}\_{\mathcal{S}}}^{T} & A\_{\mathcal{V}\_{\mathcal{S}},\mathcal{V}\_{\mathcal{S}}} \\ \end{pmatrix}.
\tag{8.2}
$$

**Figure 8.11** A simple example to illustrate nested dissection. The pattern of the original matrix (top), the partitioned graph (centre), and the corresponding symmetrically permuted matrix (bottom) are given.

This is shown for a 13 × 13 example in Figure 8.11. Provided the variables are eliminated in the permuted order, no fill occurs within the zero off-diagonal blocks. If |VS| is small and |B| and |W| are similar, these zero blocks account for approximately half the possible entries in the matrix. The reordering can be applied recursively to the submatrices *A*B*,*<sup>B</sup> and *A*W*,*<sup>W</sup> until the vertex subsets

#### **ALGORITHM 8.6 Nested dissection algorithm**

**Input:** Graph G of a symmetrically structured matrix *A* and a partitioning algorithm **PartitionAlg**.

**Output:** A permutation vector *p* that defines a new labelling of the vertices of G.

```
1: recursive function (p = nested_dissection(A, PartitionAlg))
2: if dissection has terminated then  Vertex subsets are smaller than some
                                     threshold
3: p = AMD(V, E)  Compute an AMD ordering
4: else
5: Use PartitionAlg(V, E) to obtain the vertex partitioning (B, W, VS)
6: pB = nested_dissection(AB,B, PartitionAlg)
7: pW = nested_dissection(AW,W, PartitionAlg)
8: pVS is an ordering of VS
9: Set p =
                ⎛
                ⎜
                ⎝
                  pB
                  pW
                  pVS
                     ⎞
                     ⎟
                     ⎠
10: end if
11: end recursive function
```
are of size less than some prescribed threshold. At this stage, a local ordering technique (such as AMD) is normally more effective than nested dissection, and so a switch is made. The general form of the nested dissection algorithm is summarized in Algorithm 8.6. The parameter **PartitionAlg** specifies the algorithm used in determining the partitioning of the vertices. The performance and efficacy is highly dependent on the choice of **PartitionAlg**. Originally, level set based methods were used but most current approaches use multilevel techniques that create a hierarchy of graphs, each representing the original graph, but with a smaller dimension. The smallest (that is, the coarsest) graph in the sequence is partitioned. This partition is propagated back through the sequence of graphs, while being periodically refined.

#### **8.5 Bordered Forms**

Another possibility to exploit the global matrix structure is to use bordered block forms. These forms can arise naturally in some practical applications.

#### *8.5.1 Doubly Bordered Form*

⎜

⎛

⎜

The matrix (8.2) is an example of a **doubly bordered block diagonal (DBBD)** form. More generally, a matrix is said in DBBD form if it has the block structure ⎜⎜⎜⎟⎟⎟

$$A\_{DB} = \begin{pmatrix} A\_{1,1} & & & C\_1 \\ & A\_{2,2} & & & C\_2 \\ & & \dots & & \ddots \\ & & & A\_{Nb,Nb} & C\_{Nb} \\ R\_1 & R\_2 & \dots & R\_{Nb} & B \end{pmatrix},\tag{8.3}$$

⎞

⎟

⎟

where *Nb >* 1, the blocks *Alb,lb* on the diagonal are **square** *nlb* × *nlb* matrices and the border blocks *Clb* and *Rlb* are *nlb* × *nS* and *nS* × *nlb* matrices, respectively, with *nS nlb* (1 ≤ *lb* ≤ *N b*). *B* is an *nS* × *nS* matrix. The blocks can have very different sizes. A nested dissection ordering can be used to permute a symmetrically structured matrix *<sup>A</sup>* to a symmetrically structured DBBD form (S{*Ri*} = <sup>S</sup>{*C<sup>T</sup> <sup>i</sup>* }). If <sup>S</sup>{*A*} is close to symmetric, then nested dissection can be applied to <sup>S</sup>{*<sup>A</sup>* <sup>+</sup> *<sup>A</sup><sup>T</sup>* }. In finite-element applications, the DBBD form corresponds to partitioning the underlying finite-element domain into non-overlapping subdomains; each *Alb,lb* represents the interior of a subdomain and the variables in the borders are those that lie on an interface between two or more subdomains.

Coarse-grained parallel approaches aim to factorize the *Alb,lb* blocks in parallel before solving the interface problem that connects the blocks. The block factorization of *ADB* is ⎛⎜⎜⎜⎞⎟⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟

⎟

⎜

$$A\_{DB} = \begin{pmatrix} L\_1 \\ & L\_2 \\ & & \dots \\ & & & L\_{Nb} \\ \widehat{R}\_1 & \widehat{R}\_2 & \dots & \widehat{R}\_{Nb} & L\_S \end{pmatrix} \begin{pmatrix} U\_1 \\ & U\_2 \\ & & \dots & & \dots \\ & & & U\_{Nb} & \widehat{C}\_{Nb} \\ & & & & U\_S \end{pmatrix},$$

where 

$$
\begin{pmatrix} \ddots & \ddots & \cdots & \cdots & mb \ -b \ \vdots & \ddots & \cdots \ \vdots \\\\ \end{pmatrix}
$$

where 
$$
\widehat{R}\_{lb} = R\_{lb}U\_{lb}^{-1}, \quad \widehat{C}\_{lb} = L\_{lb}^{-1}C\_{lb} \ (1 \le lb \le Nb), \quad L\_{S}U\_{S} = B - \sum\_{lb=1}^{Nb} \widehat{R}\_{lb}\widehat{C}\_{lb}.
$$

The process is summarized in Algorithm 8.7. Here, for simplicity of notation, the permutation matrices for the block factorizations are set to the identity; in practice, *Alb,lb* = *PlbLlbUlbQlb* for some permutation matrices *Plb* and *Qlb* (1 ≤ *lb* ≤ *N b*) and *S* = *PSLSUSQS* for some permutation matrices *Ps* and *QS*.

There are several opportunities to incorporate parallelism. First, the factorizations of the blocks *Alb,lb* on the diagonal are completely independent. In addition,

#### **ALGORITHM 8.7 Coarse-grained parallel LU factorization using DBBD form Input:** Matrix *ADB* in DBBD form (8.3). **Output:** Block LU factorization.


the factorization of each individual *Alb,lb* can be parallelized. The same is true for the triangular solves that update the border blocks. Second, the assembly of the interface block *S* can be partially parallelized (it can be started as soon as the first updated border blocks are available). Third, the LU factorization of *S* can be parallelized.

Observe that *S* is generally significantly denser than the other blocks and can present a computational bottleneck. In fact, not only is factorizing *S* expensive in terms of the memory and operations required, assembly updates to it can be time consuming. This is because multiple submatrices may contribute to the same entry of *S*, and these cannot be performed at the same time. Furthermore, for an efficient parallel implementation, load balance must be considered. If the work required for factorizing each of the blocks on the diagonal is not similar, then the time will be dominated by the most expensive block. One possible solution is to choose *N b* to be greater than the number of processors and use dynamic scheduling to achieve good load balance. Unfortunately, if the number of blocks increases, so too does the size of *S*.

If *A* is not SPD, then factorizing the *Alb,lb* blocks without considering the entries in the border can potentially lead to stability problems. Consider the first step in factorizing *Alb,lb* and the threshold pivoting test (7.5) for a sparse LU factorization. The pivot candidate *(Alb,lb)*<sup>11</sup> must satisfy

$$\max\_{l>1} \left| \max\_{l>1} |(A\_{lb,lb})\_{11}|, \max\_{k} |(R\_{lb})\_{k1}| \right| \le \gamma^{-1} |(A\_{lb,lb})\_{11}|,$$

where *γ* ∈ *(*0*,* 1] is the threshold parameter. Large entries in the row border matrix *Rlb* can prevent pivots being selected within *Alb,lb*. Stability can be maintained by moving rows and columns that cannot be eliminated to the borders. This increases the border size and may adversely affect the a priori sparse data structures for holding the factors, increase the work required to perform the factorization, and reduce the potential for parallelism within the factorization of the block.

\$

#### *8.5.2 Singly Bordered Form*

⎝

⎛

An alternative strategy is to permute *A* to **singly bordered block diagonal (SBBD)** form ⎜⎜⎝⎟⎟⎠

⎞

$$A\_{SB} = \begin{pmatrix} A\_{1,1} & & & & C\_1 \\ & A\_{2,2} & & & C\_2 \\ & & \ddots & & \ddots \\ & & & A\_{Nb,Nb} & C\_{Nb} \end{pmatrix},$$

where the blocks *Alb,lb* are **rectangular** *mlb* × *nlb* matrices with *mlb* ≥ *nlb* and *N b lb*=<sup>1</sup> *ml* <sup>=</sup> *<sup>n</sup>*, and the border blocks *Clb* are of order *mlb* <sup>×</sup> *nI* (*nI nlb*), where *nI* = \$*N b bl*=<sup>1</sup> *(mlb* <sup>−</sup> *nlb)*. The linear system becomes ⎛⎜⎜⎞⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟⎛⎜⎜⎜⎞⎟⎟⎟

$$
\begin{pmatrix} A\_{1,1} & & & C\_1 \\ & A\_{2,2} & & & C\_2 \\ & & \cdots & & \cdot \\ & & & A\_{Nb,Nb} & C\_{Nb} \end{pmatrix} \begin{pmatrix} x\_1 \\ \vdots \\ \vdots \\ x\_{Nb} \\ \vdots \end{pmatrix} = \begin{pmatrix} b\_1 \\ b\_2 \\ \vdots \\ b\_{Nb} \end{pmatrix}, \tag{8.4}
$$

⎝

⎠

⎝

⎠

⎠

where *xlb* is of length *nlb*, *xI* is a vector of length *nI* of interface variables, and the right-hand side vectors *blb* are of length *mlb*, such that *Alb,lb Clb*

$$
\begin{pmatrix} A\_{lb,lb} & C\_{lb} \end{pmatrix} \begin{pmatrix} \chi\_{lb} \\ \chi\_I \end{pmatrix} = b\_{lb}, \quad 1 \le lb \le Nb.
$$

$$
\text{ion of each block matrix is performed, that is}
$$

A partial factorization of each block matrix is performed, that is,

$$\left(\begin{array}{c} \text{(a.b., } \text{(b., } \text{)} \\ \end{array}\right) \left(\begin{array}{c} \text{(a.b., } \text{(b., } \text{)} \\ \end{array}\right) \qquad \text{(a.b., } \text{(b., } \text{(b., } \text{(a., } \text{(b., } \text{(b., } \text{(b., } \text{(a., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(a., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(b., } \text{(}\text{(}\text{(}\text{)} \text{(}\text{(}\text{)} \text{(}\text{)} \text{(}\text{(}\text{)} \text{(}\text{)} \text{(}\text{)} \text{(}\text{)} \text{(}\text{)} \text{)} \text{)} \end{array} \right)$$

$$\left(\begin{array}{c} \text{(} A\_{lb, lb, lb, lb)} = P\_{lb} \begin{pmatrix} L\_{lb} \\ \bar{L}\_{lb} \end{pmatrix} \right) \left(\begin{array}{c} U\_{lb} \\ \text{(}\text{U} \text{)} \end{array} \right) \mathcal{Q}\_{lb}, \tag{8.5}$$

where *Plb* and *Qlb* are permutation matrices, *Llb* and *Ulb* are *nlb* × *nlb* lower and upper triangular matrices, respectively, and if *qlb* is the number of columns in *Clb* with at least one entry, *Slb* is a *(mlb* − *nlb)* × *qlb* local Schur complement matrix. Pivots can only be chosen from the columns of *Alb,lb* because the columns of *Clb* have entries in at least one other border block *Cj b* (*j b* = *lb*). The pivot candidate *(Alb,lb)*<sup>11</sup> at the first elimination step must satisfy

$$\max\_{i>1} |(A\_{lb,lb})\_{i1}| \le \mathcal{Y}^{-1} |(A\_{lb,lb})\_{11}|,$$

and provided *A* is nonsingular, there will always be a numerically satisfactory pivot in column 1 of *Alb,lb*. The same is true at each elimination step so that *nlb* pivots can be chosen. An *nI* × *nI* matrix *S* is obtained by assembling the *N b* local

#### **ALGORITHM 8.8 Coarse-grained parallel LU factorization and solve using SBBD form**

**Input:** Linear system in SBBD form (8.4).

**Output:** Block LU factorization and computed solution *x*.

1: *S* = 0 and *zI* = 0 2: **for** *lb* = 1 : *N b* **do** 3: Perform a partial *LU* factorization (8.5) of *(Alb,lb, Clb)*. 4: Solve *Plb Llb L*¯*lb I ylb y*¯*lb* = *blb* 5: *S* = *S* + *Slb* and *zI* = *zI* + ¯*ylb* Assemble *S* and *zI* 6: **end for** 7: *S* = *PsLsUsQs Ps* and *Qs* are permutation matrices 8: Solve *PsLs yI* = *zI* and then *UsQs xI* = *yI* Forward then back substitution 9: **for** *lb* = 1 : *N b* **do** 10: Solve *Ulb Qlb xlb* = *ylb* − *U*¯*lb Qlb xI* 11: **end for**

Schur complement matrices *Slb*. The approach is summarized as Algorithm 8.8. The operations on the submatrices can be performed in parallel.

#### *8.5.3 Ordering to Singly Bordered Form*

The objective is to permute *A* to an SBBD form with a narrow column border. One way to do this is to choose the number *Nb >* 1 of required blocks and use nested dissection to compute a vertex separator <sup>V</sup><sup>S</sup> of <sup>G</sup>*(A* <sup>+</sup> *AT )* such that removing <sup>V</sup><sup>S</sup> and its incident edges splits <sup>G</sup>*(A* <sup>+</sup> *AT )* into *N b* components. Then initialize the set S<sup>C</sup> of border columns to V<sup>S</sup> and let V1*b,* V2*b,...,* V*N b* be the subsets of column indices of *A* that correspond to the *N b* components and let *ni,kb* be the number of column indices in row *i* that belong to V*kb*. If *lb* = arg max1≤*kb*≤*N b* |*ni,kb*|, then row *i* is assigned to partition *lb*. All column indices in row *i* that do not belong to V*lb* are moved into SC. Once all the rows have been considered, the only rows that remain unassigned are those that have all their nonzero entries in VS. Such rows can be assigned equally to the *N b* partitions. If *j* ∈ S<sup>C</sup> is such that column *j* of *A* has nonzero entries only in rows belonging to partition *kb*, then *j* can be removed from S<sup>C</sup> and added to V*kb*. The procedure is outlined as Algorithm 8.9. The computed vector *block* and set S<sup>C</sup> can be used to define permutation matrices *P* and *Q* such that *P AQ* = *ASB*. In practice, it may be necessary to modify the algorithm to ensure a good row balance between the number of rows in the blocks; this may lead

#### **ALGORITHM 8.9 SBBD ordering of a general matrix**

**Input:** Matrix *A*, the number *Nb >* 1 of blocks and corresponding vertex separator <sup>V</sup><sup>S</sup> of <sup>G</sup>*(A* <sup>+</sup> *AT )*.

**Output:** Vector *block* such that *block(i)* denotes the partition in the SBBD form to which row *i* is assigned (1 ≤ *i* ≤ *n*) and S<sup>C</sup> is the set of border columns.


```
4: Add up the number ni,kb of column indices belonging to Vkb, 1 ≤ kb ≤ N b
```

9: Remove *j* from V*kb* and add to S<sup>C</sup>


to a larger SC. It is also necessary to avoid adding in duplicate column indices into S<sup>C</sup> (alternatively, a final step can be added that removes duplicates).

The matching-based orderings discussed in Section 6.3 that permute off-diagonal entries onto the diagonal can increase the symmetry index of the resulting reordered matrix, particularly in cases where *A* is very sparse with a large number of zeros on the diagonal. Frequently, applying a matching ordering before ordering to SBBD form reduces the number of columns in SC.

#### **8.6 Notes and References**

The most influential early paper on orderings for sparse symmetric matrices is that of Tinney & Walker (1967). It first proposed the minimum degree algorithm (referred to as scheme 2) and the minimum fill-in algorithm (referred to as scheme 3). The fast implementation of the minimum degree algorithm using quotient graphs is summarized by George & Liu (1980a). Further developments were made throughout the 1980s, including the multiple minimum degree variant, mass elimination and external degree; key references are Liu (1985) and George & Liu (1989). An important development in the 1990s was the approximate minimum degree algorithm of Amestoy et al. (1996). Modifying the AMD algorithm for matrices with some dense rows is discussed in Dollar & Scott (2010). For a careful description of different variants of the minimum degree strategy and their complexity we recommend Heggernes et al. (2001). Rothberg & Eisenstat (1998) consider both minimum degree and minimum fill strategies and (Erisman et al., 1987) provide an early evaluation of different strategies for nonsymmetric matrices.

Jennings (1966) presents the first envelope method for sparse Cholesky factorizations. The Cuthill-McKee algorithm comes from the paper by Cuthill & McKee (1969). The GPS algorithm was originally introduced in Gibbs et al. (1976). The book by George & Liu (1981) gives a detailed description of the algorithm while Meurant (1999) includes an enlightening discussion of the relation between the CM and RCM algorithms. A quick search of the literature shows that a large number of bandwidth and profile reduction algorithms have been (and continue to be) reported. Many have their origins in the Cuthill-McKee and GPS algorithms. A widely used two-stage variant that employs level sets is the so-called Sloan algorithm (Sloan, 1986); see also Reid & Scott (1999) for details of an efficient implementation. The use of the Fiedler vector to obtain spectral orderings is introduced in Barnard et al. (1995), with analysis given in George & Pothen (1997). A hybrid algorithm that combines the spectral method with the second stage of Sloan's algorithm to further reduce the profile is proposed in Kumfert & Pothen (1997) and a multilevel variant is given by Hu & Scott (2001). de Oliveira et al. (2018) provide a recent comparison of many bandwidth and profile reduction algorithms.

Reducing the bandwidth when *A* is nonsymmetric is discussed by Reid & Scott (2006). For highly nonsymmetric *A*, Scott (1999) applies a modified Sloan algorithm applied to the row graph (that is, <sup>G</sup>*(AAT )*) to derive an effective ordering of the rows of *A* for use with a frontal solver. The approach originally proposed by Markowitz (1957) for finding pivots during an LU factorization is incorporated (in modified form) in a number of serial LU factorization codes, including the early solvers MA28 and Y12M (Duff, 1980 and Zlatev, 1991, respectively) as well as MA48 (Duff & Reid, 1996). The book of Duff et al. (2017) includes detailed discussions. To limit permutations to being symmetric, Amestoy et al. (2007) propose minimizing the Markowitz count among the diagonal entries.

A seminal paper on global orderings is George (1973), but a real revolution in the field followed the theoretical analysis of the application of nested dissection for general symmetrically structured sparse matrices given in Lipton et al. (1979). For subsequent extensions discussing separator sizes we suggest Agrawal et al. (1993), Teng (1997), and Spielman & Teng (2007).

From the early 1990s onwards, there have been numerous contributions to graph partitioning algorithms. Significant developments, including multilevel algorithms, have been driven in part by the design and development of mathematical software, notably the well-established packages METIS (2022) and Scotch (2022); both offer versions for sequential and parallel graph partitioning (see also the papers by Karypis & Kumar, 1998a,b and Chevalier & Pellegrini, 2008). The book by Bichot & Siarry (2013) discusses a number of contributions, including hypergraph partitioning, which is well suited to parallel computational models (see, for example, Uçar & Aykanat, 2007 and references to the use of hypergraphs given in the survey article of Davis et al., 2016; they can also be used for profile reduction Acer et al., 2019).

Hu et al. (2000) present a serial algorithm for ordering nonsymmetric *A* to SBBD form; an implementation is available as HSL\_MC66 within the HSL mathematical software library. Algorithm 8.9 is from Hu & Scott (2005) (see also Duff & Scott, 2005). Alternatively, hypergraphs can be used for SBBD orderings. The best-known packages are the serial code PaToH of Aykanat et al. (2004) and the parallel code PHG from Zoltan (2022).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 9 Algebraic Preconditioners and Approximate Factorizations**

*In conjunction with iterative methods, preconditioning is often the vital component in enabling the solution of such (linear) systems when the dimension is large. – Wathen (2015)*

*Preconditioning involves exploiting ideas from sparse direct solvers. Gradually, iterative methods started to approach the quality of direct solvers. In earlier times, iterative methods were often special purpose in nature... Now iterative methods are almost mandatory. – Saad (1996b).*

When a matrix factorization is performed using finite precision arithmetic, the computed factors are not the exact factors. Despite this, the objective of sparse direct methods is normally to compute solutions that are accurate within the precision used. As discussed in Chapter 7, theoretical results can be used to assess both stability and accuracy.

The effort to obtain results that are as accurate as possible can lead to complex coding and unavoidable inefficiencies that can be magnified by modern computer architectures. Furthermore, in some situations, more accuracy than is needed (or is justified by the input data) is sought by a direct method. These issues can potentially be addressed by intentionally relaxing the required accuracy of the computed factors. In Section 7.3.3, we discussed static pivoting that allows pivots to be explicitly perturbed during a matrix factorization to enable them to be selected, thereby reducing the computational costs of the factorization (in terms of time and memory). The penalty is that the factorization may be less stable and a refinement process (such as described in Algorithm 7.3) may be needed to improve the accuracy of the computed solution. However, even with sophisticated theoretical and algorithmic tools, factorizations that use such strategies can still be prohibitively expensive and may not be fully robust. An alternative approach is to compute a simpler and cheaper and sparser approximate factorization of *A* (or of *A*−1) and to use this as a preconditioner in combination with an iterative solver to derive a suitable solution of the linear system. The main obstacle is that the choice of an efficient preconditioner is highly problem dependent: what works well for problems from one application may not help for those of a different origin. Our focus is on algebraic preconditioners that are often successfully used in the solution of linear systems arising from a range of diverse applications.

Algebraic preconditioners do not require knowledge of the provenance of the linear system, and their construction relies solely on the matrix *A* (which may only be available implicitly, that is, the action of *A* on vectors is known, but *A* itself is not supplied). They are general methods that are particularly important when little is known about the underlying problem and they are widely applicable because they are designed with few restrictions. However, if more information is known, it can be more effective to use a specialized preconditioner that is designed for the specific application. This division between approaches to preconditioning essentially amounts to whether we are "given a problem" or "given a matrix": algebraic preconditioning is primarily concerned with the latter.

In the following, we refer to an approximate factorization as an **incomplete factorization** to distinguish it from a **complete factorization** of a direct method.

#### **9.1 Introduction to Iterative Solvers**

The two main classes of iterative methods for solving *Ax* = *b* are **stationary** iterative methods (also sometimes called **relaxation** or **simple** methods) and **Krylov subspace** methods. We briefly introduce each class.

#### *9.1.1 Stationary Iterative Methods*

Stationary iterative methods work by splitting *A* as follows:

$$A = M - N,$$

where the matrix *M* is chosen to be nonsingular and easy to invert. Starting with an initial guess *x(*0*)* , the iterations are then given by

$$\mathbf{x}^{(k+1)} = \mathbf{M}^{-1} N \mathbf{x}^{(k)} + \mathbf{M}^{-1} b, \quad k = 0, 1, \ldots \tag{9.1}$$

This can be rewritten as

$$\mathbf{x}^{(k+1)} = \mathbf{x}^{(k)} + M^{-1}(b - A\mathbf{x}^{(k)}) = \mathbf{x}^{(k)} + M^{-1}r^{(k)}, \quad k = 0, 1, \ldots,\tag{9.2}$$

where the vector *<sup>r</sup>(k)* <sup>=</sup> *<sup>b</sup>* <sup>−</sup> *Ax(k)* is the **residual** on the *<sup>k</sup>*-th iteration. Observe that by substituting *<sup>b</sup>* <sup>=</sup> *<sup>r</sup>(k)* <sup>+</sup> *Ax(k)* into *<sup>x</sup>* <sup>=</sup> *<sup>A</sup>*−<sup>1</sup> *<sup>b</sup>*, we obtain

$$\boldsymbol{x} = \boldsymbol{A}^{-1}(\boldsymbol{r}^{(k)} + A\boldsymbol{x}^{(k)}) = \boldsymbol{x}^{(k)} + \boldsymbol{A}^{-1}\boldsymbol{r}^{(k)},$$

and if *M* is used to approximate *A*, we again get the iteration (9.2). From (9.2),

$$r^{(k+1)} = b - A(\mathbf{x}^{(k)} + M^{-1}r^{(k)}) = (I - AM^{-1})r^{(k)} = \dots = (I - AM^{-1})^{k+1}r^{(0)},\tag{9.3}$$

and if *<sup>e</sup>(k)* <sup>=</sup> *<sup>x</sup>* <sup>−</sup> *<sup>x</sup>(k)* is the error vector on iteration *<sup>k</sup>*, then

$$e^{(k+1)} = M^{-1} N \, e^{(k)} = \dots = (M^{-1}N)^{k+1} \, e^{(0)} = (I - M^{-1}A)^{k+1} \, e^{(0)}.\tag{9.4}$$

The matrix *<sup>I</sup>* <sup>−</sup> *<sup>M</sup>*−1*<sup>A</sup>* or *<sup>I</sup>* <sup>−</sup> *AM*−<sup>1</sup> is called the **iteration matrix**. In general, (9.3) is evaluated rather than (9.4) because *e(*0*)* is unknown and (9.3) computes the residuals that are often used to monitor convergence.

**Theorem 9.1 (Saad 2003b; Greenbaum 1997)** *For any initial x(*0*) and vector b, the iteration (9.1) converges if and only if the spectral radius of the iteration matrix (I* <sup>−</sup> *<sup>M</sup>*−1*A) is less than unity.*

*Proof* The **spectral radius** of an *n* × *n* matrix *C* with eigenvalues *λ*1*, λ*2*,...,λn* is defined to be

$$\rho(C) = \max\{|\lambda\_i| \mid 1 \le i \le n\}.\tag{9.5}$$

Furthermore, the sequence of matrix powers *<sup>C</sup>k*, *<sup>k</sup>* <sup>=</sup> <sup>0</sup>*,* <sup>1</sup>*,...,* converges to zero if and only if *ρ(C) <* 1. It follows from (9.4) that if the spectral radius of *(I* <sup>−</sup>*M*−1*A)* is less than unity, then the iteration (9.1) converges for any *x(*0*)* and *b*. Conversely, the relation

$$\mathbf{x}^{(k+1)} - \mathbf{x}^{(k)} = (I - M^{-1}N)(\mathbf{x}^{(k)} - \mathbf{x}^{(k-1)}) = \dots = (I - M^{-1}N)^k M^{-1} (b - A\mathbf{x}^{(0)})$$

shows that if the iteration converges for any *<sup>x</sup>(*0*)* and *<sup>b</sup>*, then *(I* <sup>−</sup> *<sup>M</sup>*−1*N )kv* converges to zero for any *<sup>v</sup>*. Consequently, *ρ(I* <sup>−</sup> *<sup>M</sup>*−1*A)* must be less than unity, and the result follows.

It is generally impractical to compute the spectral radius and sufficient conditions that guarantee convergence are used. Because *ρ(C)* ≤ *C* for any matrix norm, a sufficient condition is *<sup>I</sup>* <sup>−</sup> *<sup>M</sup>*−1*A <sup>&</sup>lt;* 1. A small spectral radius leads to rapid convergence, and the closer the eigenvalues of *M*−1*A* are to unity, the faster the convergence. However, the eigenvalue distribution (not just the spectral radius) is important in evaluating the rate of convergence.

Several standard stationary methods are obtained from the splitting

$$A = D\_A + L\_A + U\_A,\tag{9.6}$$

where *DA* is a diagonal matrix that represents the diagonal part of *A*, and *LA* and *UA* are the strictly lower and upper triangular parts of *A*, respectively. If *ω >* 0 is a scalar parameter, classical methods include:


#### *9.1.2 Krylov Subspace Methods*

Non-stationary iterative methods are of the form

$$\boldsymbol{x}^{(k+1)} = \boldsymbol{x}^{(k)} + \boldsymbol{\alpha}^{(k)} \boldsymbol{M}^{-1} \boldsymbol{r}^{(k)}, \quad k = 0, 1, \ldots, \ell$$

where the *ω(k)* are scalars. In this class of methods, Krylov subspace methods are the most effective. Given a vector *<sup>y</sup>*, the *<sup>k</sup>*-th Krylov subspace <sup>K</sup>*(k)(A, y)* generated by *A* from the vector *y* is defined to be

$$
\mathcal{K}^{(k)}(A, \mathbf{y}) = \text{span}(\mathbf{y}, A\mathbf{y}, \dots, A^{k-1}\mathbf{y}).
$$

The idea behind Krylov subspace methods is to generate a sequence of approximate solutions *<sup>x</sup>(k)* <sup>∈</sup> *<sup>x</sup>(*0*)* <sup>+</sup> <sup>K</sup>*(k)(A, r(*0*) )* such that the norm of the corresponding residuals *<sup>r</sup>(k)* <sup>∈</sup> <sup>K</sup>*(k*+1*) (A, r(*0*) )* converges to zero. For symmetric positive definite (SPD) systems, the Krylov subspace method of choice is the conjugate gradient (CG) method. For nonsymmetric systems, there are a number of popular methods, including the generalized minimal residual (GMRES) method and the biconjugate gradient (BiCG) method, but there is no single method of choice. The key feature they have in common is that at each iteration only matrix-vector products with *A* (and possibly with *AT* in the nonsymmetric case) are required.

Krylov subspace methods are powerful and nowadays, when combined with a preconditioner, comprise the most widely used class of preconditioned iterative methods. Because they build a basis, in exact arithmetic, convergence is achieved in at most *n* iterations (but in the presence of rounding errors, this is not guaranteed). If *n* is large, it is impractical to perform *O(n)* iterations; the hope is that the process returns a sufficiently accurate solution far earlier. Unfortunately, for a given *A*, righthand side vector *b*, and initial guess *x(*0*)* , it is usually not possible to predict the rate of convergence. If *A* is an SPD matrix, then it can be shown that the approximate solution *x(k)* at iteration *k* computed using the CG method satisfies 

$$\|\|\mathbf{x} - \mathbf{x}^{(k)}\|\|\_{A} \le 2\left(\frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1}\right)^k \|\|\mathbf{x} - \mathbf{x}^{(0)}\|\|\_{A, \omega}$$

where ·*<sup>A</sup>* is the *A*-norm, and *κ(A)* is the spectral condition number given by (7.15). Clearly, there is good (fast) convergence when *κ(A)* is small, but poor (slow) convergence usually occurs if *κ(A)* 1. But this error bound can be highly pessimistic. It does not show the potential for CG to converge superlinearly or that the rate of convergence depends on the distribution of all the eigenvalues of *A*. In practice, it is not normally possible to obtain detailed spectral information. Thus, even for CG, preconditioning is often based on experimentation. For non-SPD matrices, less is known and methods that guarantee the monotonic reduction of a relevant quantity at each iteration are sometimes favoured. For example, if the minimal residual (MINRES) method is used for solving symmetric indefinite systems, then in exact arithmetic, the norm of the residual is monotonically decreasing. However, no general descriptive convergence theory is available for Krylov subspace methods for nonsymmetric systems (including GMRES). This is a significant problem because, without theory to guide us, preconditioning must be heuristic.

#### **9.2 Introduction to Algebraic Preconditioners**

Preconditioning corresponds to the application of a matrix (or a linear operator) to the original linear system to yield a different linear system that has more favourable properties. Consider the preconditioned linear system

$$M^{-1}Ax = M^{-1}b.\tag{9.7}$$

Here *M*−<sup>1</sup> is applied to *A* from the left. We say that *A* is preconditioned from the left and *M* is a left preconditioner. Analogously, the linear system can be preconditioned from the right

$$A M^{-1} \text{ y } = b, \qquad \text{x } = M^{-1} \text{ y }. \tag{9.8}$$

The following result states that it is not possible to determine a priori which variant is the best.

**Theorem 9.2 (Mendelsohn 1956)** *Let δ and be positive numbers. Then, for any n* ≥ 3*, there exist nonsingular n* × *n matrices A and M such that all the entries of <sup>M</sup>*−1*<sup>A</sup>* <sup>−</sup> *<sup>I</sup> have absolute value less than <sup>δ</sup> and all the entries of AM*−<sup>1</sup> <sup>−</sup> *<sup>I</sup> have absolute values greater than .*

Nevertheless, the choice between left and right preconditioning is still important and may be based on the properties of the coupling of the preconditioner with the iterative method or on the distribution of the eigenvalues of *A*. The computed quantities that are readily available during a preconditioned iterative method depend on how the preconditioner is applied and this may influence the choice. These quantities may be used, for example, to decide when to terminate the iterations. An obvious advantage of right preconditioning is that in exact arithmetic, the residuals for the right preconditioned system are identical to the true residuals, enabling convergence to be monitored accurately. In some cases, the numerical properties of an implementation and/or the computer architecture may also play a part.

For *M* in factorized form *M* = *M*1*M*2, **two-sided** (or **split**) preconditioning is an option. The iterative method then solves the transformed system

$$M\_1^{-1} A M\_2^{-1} \mathbf{y} = M\_1^{-1} \, \boldsymbol{b}, \qquad \mathbf{x} = M\_2^{-1} \, \mathbf{y}. \tag{9.9}$$

If *<sup>A</sup>* and *<sup>M</sup>* are SPD matrices, then *<sup>M</sup>*<sup>2</sup> <sup>=</sup> *<sup>M</sup><sup>T</sup>* <sup>1</sup> and we would like the preconditioned matrix *M*−<sup>1</sup> <sup>1</sup> *AM*−*<sup>T</sup>* <sup>1</sup> to be SPD. However, it is not necessary to use a two-sided transformation with the preconditioned conjugate gradient (PCG) method because it can be formulated using the *M*-inner product in which the matrix *M*−1*A* is selfadjoint.

**Theorem 9.3 (Saad 2003b; van der Vorst 2003)** *Let A and M be SPD matrices. Then M*−1*A is self-adjoint in the M-inner product.*

*Proof* Self-adjointness is implied by the following chain of equivalences.

$$<\langle M^{-1}Ax, \mathbf{y} \rangle\_{M} = \langle Ax, \mathbf{y} \rangle = \langle \mathbf{x}, A\mathbf{y} \rangle = \langle \mathbf{x}, MM^{-1}A\mathbf{y} \rangle$$

$$= \langle Mx, M^{-1}A\mathbf{y} \rangle = \langle \mathbf{x}, M^{-1}A\mathbf{y} \rangle\_{M}.$$

Left preconditioned CG with the *M*-inner product is mathematically equivalent to right preconditioned CG with the *M*−1-inner product. If *A* is symmetric but not positive definite, the PCG method can formally be written down, but the necessary conditions for convergence may not be satisfied and the method may break down (division by a zero quantity).

#### *9.2.1 Desirable Preconditioner Properties*

An obvious objective is for the preconditioner to lead to rapid convergence. As already noted, if the matrix *A* is SPD, then the convergence rate of the CG method depends on the distribution of its eigenvalues. The preconditioner should aim to reduce the condition number, but this is not necessarily sufficient to give fast convergence. For general matrices, despite the lack of theoretical guarantees regarding convergence, many useful preconditioners have nevertheless been motivated by bounding the condition number of the preconditioned matrix.

Choosing a preconditioner is often based on how costly it is to compute and on some indicators that potentially reflect its quality. In particular, the **accuracy** of a preconditioner *M* can be assessed using the norm of the error matrix

$$\|E\| = \|M - A\|,$$

and its **stability** can be measured using

$$\|M^{-1}E\| = \|I - M^{-1}A\| \quad \text{or} \quad \|EM^{-1}\| = \|I - AM^{-1}\|.$$

If a preconditioner is used to solve a large number of systems over which the cost of constructing it can be amortized, then the expense of constructing *M* in terms of time may not be the driving factor. However, as the preconditioner must be applied at each iteration of the solver, unless very few iterations are performed, it is essential that each application is inexpensive. Each application *M*−1*w* involves solving a linear system *Mv* = *w*. If *M* is in factorized form and the factors are (block) triangular, this is straightforward but because they are inherently serial and hard to parallelize, repeated substitutions can be a critical computational bottleneck. In some cases, rather than *M*, the inverse *M*−<sup>1</sup> is computed directly. In this case, we have an **approximate inverse preconditioner**. Applying such a preconditioner involves only matrix-vector multiplications, which are normally easier to parallelize. However, because the inverse of an irreducible matrix is dense (Theorem 7.3), it is important that *M*−<sup>1</sup> is constructed to be sparse. Such preconditioners are discussed in Chapter 11.

#### *9.2.2 Simple Algebraic Preconditioners*

The simplest preconditioner consists of the diagonal of the matrix *M* = *DA*. This is known as the (point) Jacobi preconditioner. Block versions can be derived by partitioning V = {1*,* 2*,...,n*} into mutually disjoint subsets V1*,...,* V*<sup>l</sup>* and then setting 

$$m\_{lj} = \begin{cases} a\_{lj} & \text{if } i \text{ and } j \text{ belong to the same subset } \mathcal{V}\_k \text{ for some } k, \ 1 \le k \le l, \\ 0 & \text{otherwise.} \end{cases}$$

Often, natural choices for the partitioning suggest themselves. For example, supervariables can be used or the partitioning may be chosen to coincide with the division of variables over the processors in a parallel environment. Jacobi preconditioners need very little storage and are easy to implement.

The SSOR preconditioner, like the Jacobi preconditioner, can be derived from *A* without any work. If *A* is symmetric, then using the notation (9.6), the SSOR preconditioner is defined to be

$$M = (D\_A + L\_A) D\_A^{-1} (D\_A + L\_A)^T,\tag{9.10}$$

or, using a parameter 0 *<ω<* 2, as

$$M = \frac{1}{2 - \omega} (\frac{1}{\omega} D\_A + L\_A) (\frac{1}{\omega} D\_A)^{-1} (\frac{1}{\omega} D\_A + L\_A)^T.$$

The optimal value of *ω* will reduce the number of iterations needed for convergence of the iterative solver, but it is usually prohibitively expensive to compute the spectral information needed to calculate it. Again, block variants are possible.

#### *9.2.3 The Eisenstat Trick*

Within a preconditioned iterative solver, it is generally cheaper to apply *M*−<sup>1</sup> and *A* separately, rather than explicitly forming and storing the preconditioned matrix. However, in special cases, it is possible to improve efficiency by combining the action of the preconditioner with the matrix-vector multiplication. One such approach is called the **Eisenstat trick**. Consider the matrix splitting (9.6), and let *M* be given by

$$M = (D + L\_A) \left[ D^{-1} (D + U\_A) \right] = M\_1 \, M\_2,\tag{9.11}$$

where *D* is a nonsingular diagonal matrix. The SSOR matrix (9.10) is one example in the symmetric case but more generally *D* = *DA*. Using two-sided preconditioning, (9.9) becomes

$$A' \mathbf{y} = M\_1^{-1} A M\_2^{-1} \mathbf{y} = (D + L\_A)^{-1} A [D^{-1} (D + U\_A)]^{-1} \mathbf{y} = (D + L\_A)^{-1} b. \tag{9.12}$$

Setting

$$
\bar{L} = D^{-1} L\_A, \quad \bar{U} = D^{-1} U\_A, \quad \bar{A} = D^{-1} A, \quad \text{and } \bar{b} = (I + \bar{L})^{-1} D^{-1} b,
$$

and using (9.6), we obtain

$$A' = (D + L\_A)^{-1} A [D^{-1} (D + U\_A)]^{-1} = [(D + L\_A)^{-1} D] D^{-1} A [D^{-1} (D + U\_A)]^{-1}$$

$$= [D^{-1} (D + L\_A)]^{-1} D^{-1} A (I + D^{-1} U\_A)^{-1} = (I + \bar{L})^{-1} \bar{A} (I + \bar{U})^{-1}.$$

That is, the system in (9.12) becomes

$$A'\mathbf{y} = (I + \bar{L})^{-1}\bar{A}(I + \bar{U})^{-1}\mathbf{y} = (I + \bar{L})^{-1}D^{-1}b = \bar{b}.\tag{9.13}$$

If *y* solves (9.13), then the solution *x* of *(I* + *U)x* ¯ = *y* solves *Ax* = *b*. But the expression for *A* can be further transformed as

$$A' = (I + \bar{L})^{-1} \left( I + \bar{L} + D^{-1} D\_A - 2I + I + \bar{U} \right) (I + \bar{U})^{-1}$$

$$= (I + \bar{L})^{-1} \left[ (I + \bar{L})(I + \bar{U})^{-1} + (D^{-1} D\_A - 2I)(I + \bar{U})^{-1} + I \right]$$

$$= (I + \bar{U})^{-1} + (I + \bar{L})^{-1} \left[ (D^{-1} D\_A - 2I)(I + \bar{U})^{-1} + I \right].$$

Thus, to compute *z* = *A <sup>w</sup>* <sup>=</sup> *(I* <sup>+</sup>*L)*¯ <sup>−</sup>1*A(I* ¯ <sup>+</sup>*U )*¯ <sup>−</sup>1*<sup>w</sup>* for a given *<sup>w</sup>*, it is necessary only to solve two triangular systems

$$(I + \bar{U}) \ z\_1 = w \quad \text{followed by} \quad (I + \bar{L}) \ z\_2 = (D^{-1} D\_A - 2I) \ z\_1 + w$$

and then set *z* = *z*<sup>1</sup> + *z*2. Note that this trick is not a preconditioner: it is simply a way of applying the preconditioner (9.11).

### **9.3 Some Special Classes of Matrices**

⎜

⎜

⎜

The development of algebraic preconditioners has historically been closely connected to their earliest application, which was solving linear systems arising from the discretization of partial differential equations. Consider a two-dimensional Poisson problem discretized on a given domain by a uniform regular grid using finite differences, with zero Dirichlet conditions on the boundary. The resulting matrix for a 3 × 3 rectangular grid using the natural ordering of the vertices is given by ⎛⎜⎞⎟

$$A = \begin{pmatrix} 4 & -1 & -1 \\ -1 & 4 & -1 & -1 \\ & -1 & 4 & -1 \\ -1 & & 4 & -1 & -1 \\ & -1 & -1 & 4 & -1 & -1 \\ & & -1 & -1 & 4 & -1 \\ & & -1 & & 4 & -1 \\ & & & -1 & -1 & 4 & -1 \\ & & & -1 & -1 & 4 \end{pmatrix}. \tag{9.14}$$

⎟

⎟

⎟

If the spatial discretization on the domain is characterized by the mesh parameter *h*, then the size of *A* is inversely proportional to *h*. Expressing some matrixrelated quantities asymptotically as functions of *h* can be useful if the discretized domain is bounded. For example, the condition number of the matrix (9.14) depends asymptotically on *h*−2. Matrices with similar banded sparsity patterns with nonzeros on only a small number of subdiagonals arise from simple finite difference or finite element discretizations of other partial differential equations. They can be considered as particular cases of more general special classes of matrices whose properties can be derived using the theoretical background behind the discretizations.

M-matrices is one such class. Let the off-diagonal entries of the nonsingular matrix *A* be nonpositive (that is, *aij* ≤ 0 for all *i* = *j* ). Then *A* is a (nonsingular) **M-matrix** if one of the following holds:


The matrix (9.14) is an example of an M-matrix. A symmetric M-matrix is known as a Stieltjes matrix, and such a matrix is positive definite.

The class of **nonsingular H-matrices** includes matrices coming from simple discretizations of convection–diffusion problems. The **comparison** matrix *C(A)* of *A* is defined to have entries 

$$C(A)\_{lj} = \begin{cases} -|a\_{lj}|, & i \neq j, \\\ |a\_{lj}|, & i = j. \end{cases}$$

If *C(A)* is a nonsingular M-matrix, then *A* is a nonsingular H-matrix. 

We also recall diagonally dominant matrices. *A* is **diagonally dominant by rows** if

$$\sum\_{\substack{j=1,\ j\neq i}}^n |a\_{ij}| \le |a\_{il}|, \quad 1 \le i \le n. \tag{9.15}$$

*A* is **strictly diagonally dominant by rows** if strict inequality holds in (9.15) for all *i*. *A* is (strictly) diagonally dominant by columns if *AT* is (strictly) diagonally dominant by rows. *A* is said to be **irreducibly diagonally dominant** if it is irreducible and (9.15) is satisfied with strict inequality for at least one row *i*. If *A* is strictly diagonally dominant by rows or columns or is irreducibly diagonally dominant, then it is nonsingular and factorizable. The class of diagonally dominant matrices is closely connected to that of nonsingular H-matrices. For example, the property that there exists a diagonal matrix *D* with positive entries such that *AD* is strictly diagonally dominant is equivalent to *A* being a nonsingular H-matrix.

#### **9.4 Introduction to Incomplete Factorizations**

Preconditioners based on an incomplete factorization of *A* in which entries are dropped during the factorization are widely used in computational science and engineering, especially when the underlying physics of a problem is difficult to exploit. Besides being used as standalone preconditioners, incomplete factorizations are important within more sophisticated methods. For example, they can be used to precondition subdomain solves in domain decomposition schemes or as a smoother in multigrid methods. Incomplete factorizations fall into three main classes:

(i) Threshold-based methods in which the locations of permissible fill-in are determined in conjunction with the numerical factorization of *A*; entries of the computed factors of absolute value less than a prescribed threshold *τ >* 0 are dropped. Success relies on determining a suitable *τ* . This is highly problem dependent and is influenced by the scaling of *A*.


**Figure 9.1** Illustration of matrix sparsification. *f* denotes filled entries in the factors. On the left is the original matrix *A* with its filled entries, in the centre is the permuted matrix with its filled entries, and on the right is the sparsified permuted matrix after dropping the entries of *A* in positions *(*1*,* 3*)* and *(*3*,* 1*)* (it has no filled entries).


The basic dropping approaches can be combined and they can be employed in conjunction with discarding entries in *A* before the factorization commences. This initial sparsification is appealing because it may be possible to obtain an incomplete factorization by computing a complete factorization of the sparsified matrix. Sparsification can be performed by value or by position. Figure 9.1 illustrates sparsification of *A* after permuting it reveals a block structure (the permutation can be found using, for example, Algorithm 3.7 or 3.8).

### *9.4.1 Incomplete Factorization Breakdown*

Dropping entries can lead to **breakdown** of the incomplete factorization, that is, a zero pivot may be encountered during the factorization (or a non-positive pivot in the Cholesky case). It is only possible to predict when this will happen in special cases, as stated in the following theorem, which is a consequence of the fact that being an M-matrix or an H-matrix is preserved in the sequence of the Schur complements during the factorization. This result does not hold for general SPD matrices.

**Theorem 9.4 (Meijerink & van der Vorst 1977; Manteuffel 1980; Varga et al. 1980)** *Let A be a nonsingular M-matrix or H-matrix. If the target sparsity pattern of the incomplete factors contains the positions of the diagonal entries, then the incomplete factorization of A does not break down.*

⎟

*.*

To illustrate the error accumulation in the incomplete factorization of an Mmatrix using dropping, consider the example given in (9.14). Let *E* be the error matrix. *E* is initialized to zero, and at each stage of the factorization, the dropped entries are added into it. After one step of the complete factorization of *A*, the partially eliminated matrix *A(*2*)* is ⎛⎜⎜⎞⎟⎟

⎜

$$A^{(2)} = \begin{pmatrix} 4 & -1 & -1 \\ & 3.75 & -1 - 0.25 & -1 \\ & -1 & 4 & -1 \\ & -0.25 & 3.75 & -1 & -1 \\ & -1 & -1 & 4 & -1 \\ & & -1 & -1 & 4 & -1 \\ & & -1 & & 4 & -1 \\ & & & -1 & -1 & 4 & -1 \\ & & & & -1 & -1 & 4 \end{pmatrix}$$

Suppose the filled entries −0*.*25 in positions *(*2*,* 4*)* and *(*4*,* 2*)* are dropped. Then the values of the corresponding diagonal entries in the subsequent elimination matrices are larger than they would have been without any dropping. Furthermore, as all the off-diagonal nonzero entries are negative, for any target sparsity pattern the dropped entries are negative. The M-matrix property applies to all subsequent Schur complements, which implies that all the entries added into *E* are negative and so the absolute values of the entries in *E* grow as the factorization proceeds (the contributions can never cancel each other out). Thus, although the factorization does not break down, the growth in the error is potentially a problem for the accuracy of an incomplete factorization of an M-matrix.

#### *9.4.2 Perturbing Entries to Prevent Breakdown*

Modifying the diagonal entries of *A* is a common approach to avoid breakdown in an incomplete factorization. Breakdown is illustrated in Figure 9.2. A simple a posteriori remedy is to perturb the diagonal value that has caused breakdown. In this example, increasing *a*<sup>44</sup> so that *d*˜ <sup>44</sup> has a (small) positive value. Unfortunately, practical experience of making simple ad hoc modifications is generally not very positive. This is because making a local perturbation when breakdown occurs (or is close to occurring) may be too late for the resulting factorization to be good enough to be useful as a preconditioner (growth may already have happened in some of the factor entries). This applies to standard incomplete factorizations and to approximate inverses.

An alternative and more effective strategy to avoid breakdown is to modify all the diagonal entries of *A* a priori and then compute an incomplete factorization of *A* + *αI* , where the shift *α >* 0 is a scalar parameter. It is always possible to find *α* such that *A* + *αI* is nonsingular and diagonally dominant and is thus an H-matrix. However, being an H-matrix is not a necessary condition for a

$$A = \begin{pmatrix} 3 & -2 & 2 \\ -2 & 3 & -2 \\ -2 & 3 & -2 \\ 2 & -2 & 8 \end{pmatrix}, \ L = \begin{pmatrix} 1 \\ -2/3 & 1 \\ -6/5 & 1 \\ 2/3 & 4/5 & -2/3 & 1 \end{pmatrix}, \ D = \begin{pmatrix} 3 \\ 5/3 \\ -1/5 \\ 16/3 \end{pmatrix}.$$

$$\widetilde{L} = \begin{pmatrix} 1 \\ -2/3 & 1 \\ -6/5 & 1 \\ 2/3 & -10/3 & 1 \end{pmatrix}, \ \widetilde{D} = \begin{pmatrix} 3 \\ 5/3 \\ -1/5 \\ 0 \end{pmatrix}.$$

**Figure 9.2** An example to illustrate breakdown. The matrix *A* and its square root-free factors are given together with the incomplete factors *L* and *D* that result from dropping the entry *l*<sup>24</sup> during the factorization. *d*˜ <sup>44</sup> = 0 means the incomplete factorization has broken down. ----

#### **ALGORITHM 9.1 Trial-and-error global shifted incomplete factorization**

**Input:** Matrix *A*, incomplete factorization algorithm, initial shift *α(*0*)* **Output:** Shift *α* and incomplete factors *L* and *U* such that *A* + *α* ≈ *L U* --

1: **for** *k* = 0*,* 1*,* 2*,...* **do** 2: *<sup>A</sup>* <sup>+</sup> *<sup>α</sup>(k)I* <sup>≈</sup> *<sup>L</sup> U* Perform incomplete factorization 3: If successful, *<sup>α</sup>* <sup>=</sup> *<sup>α</sup>(k)* and **return** 4: *<sup>α</sup>(k*+1*)* <sup>=</sup> <sup>2</sup>*α(k)* 5: **end for**

matrix to be factorizable and, in practice, much smaller values of *α* can provide incomplete factorizations for which *E* is small. A simple trial-and-error procedure for choosing a shift is given in Algorithm 9.1. The initial shift *<sup>α</sup>(*0*)* <sup>=</sup> <sup>0</sup> is reasonable if *A* is an SPD matrix or, more generally, has positive diagonal entries. If *α(*0*) >* 0 and the incomplete factorization of *<sup>A</sup>* <sup>+</sup> *<sup>α</sup>(*0*) I* is successful, then the algorithm can be modified to reduce *α(*0*)* (for example, it could be replaced by *α(*0*) /*2) and then restarted. The potential benefit is a smaller *E* (and hopefully a higher quality preconditioner) but at the cost of performing further incomplete factorizations. Observe that *A* should be prescaled to try and limit the size of *α*.

#### *9.4.3 Pivoting to Prevent Breakdown*

An alternative approach to avoid small pivots is to follow what is done in sparse direct solvers and incorporate partial or threshold pivoting within the incomplete factorization algorithm. This potentially makes the factorization significantly more expensive and much more complicated to implement efficiently. As with sparse direct solvers, preprocessing can limit the need for pivoting. If *A* is nonsymmetric, then row and column permutations can be used to bring large entries onto the diagonal before the factorization commences. In particular, the weighted matching ordering and scaling discussed in Section 7.4.2 can be used. In the symmetric case, symmetry is preserved by choosing pivots from the diagonal. Again, the matrix should be prescaled, and then at each stage, a straightforward choice is to select as the next pivot the diagonal entry of the largest absolute value in the remaining active submatrix. If there is no suitable diagonal entry (for example, if the absolute values of all the remaining diagonal entries are less than some threshold), then either the diagonal can be modified or 2 × 2 pivots that preserve symmetry can be used.

One way to attempt to minimize the norm of the error matrix *E* is to select the pivot candidate to minimize the sum of the absolute values of the dropped (discarded) entries. However, this **minimum discarded fill** ordering is typically too expensive to be useful in practice.

#### **9.5 Factorizations as Preconditioner Components**

Sometimes (incomplete) factorizations are employed as components in the construction of more complex preconditioners. Here some possible approaches are briefly discussed.

#### *9.5.1 Polynomial Preconditioning*

Polynomial preconditioning selects a polynomial *φ* and applies a Krylov subspace method to solve either

$$
\phi(A)Ax = \phi(A)\,b
$$

(left preconditioning) or

$$A\,\phi(A)\,\text{y} = b, \quad \text{x} = \phi(A)\,\text{y}$$

(right preconditioning). *φ* should be of small degree and chosen to enhance convergence. Consider the characteristic polynomial *φn(μ)* = det*(A* − *μI )* of *A* (det denotes the determinant). The Cayley–Hamilton theorem states that *A* satisfies its own characteristic equation so that *φn(A)* = *n*

$$\phi\_n(A) = \sum\_{j=0}^n \beta\_j \, A^j = 0,$$

where *βj* (0 ≤ *j* ≤ *n*) are the coefficients of the characteristic polynomial (*βn* <sup>=</sup> <sup>1</sup>*, β*<sup>0</sup> <sup>=</sup> *(*−1*)<sup>n</sup>* det*(A)*). Provided *<sup>A</sup>* is nonsingular,

$$A^{-1} = (-1)^{n+1} \frac{1}{\det(A)} \sum\_{j=1}^{n} \beta\_j \ A^{j-1}.$$

 

A preconditioner can be constructed by taking the first *k* terms, possibly weighted by some suitable scalar coefficients, that is,

$$\begin{aligned} & \text{rated by taking the} \\ & \text{ents, that is,} \\ & M^{-1} = \sum\_{j=0}^{k} \mathcal{V}\_j A^k. \end{aligned}$$

An important question is why such a preconditioner can help in the presence of the optimality properties of Krylov subspace methods. For example, at iteration *<sup>k</sup>* <sup>+</sup> <sup>1</sup> of the CG method, *<sup>x</sup>(k*+1*)* satisfies

$$x^{(k+1)} = x^{(0)} + \phi\_k(A) \, r^{(0)}, \,\, k = 0, \, 1, \ldots, \,\_2$$

where *φk* is a monic polynomial of degree *k*. This polynomial is optimal in the sense that *x(k*+1*)* minimizes

$$\|\|\mathbf{x} - \mathbf{x}^{(k+1)}\|\|\_{A}^{2}.\tag{9.16}$$

A preconditioner that is a polynomial in *A* cannot speed the convergence because the resulting iteration again forms the new *x(k*+1*)* as *x(*0*)* plus a polynomial in *A* times *r(*0*)* , and thus the same or a higher degree polynomial is needed to achieve the same value of (9.16). Consequently, the number of matrix-vector multiplications cannot decrease. Nevertheless, polynomial preconditioning can be useful for a number of reasons.


Even if only a small number of terms are used in the polynomial approximating *A*−1, a crucial issue is determining the coefficients *γ*0*,...,γk*. A straightforward way of doing this is based on the Neumann series of a matrix *<sup>C</sup>* given by \$+∞ *<sup>j</sup>*=<sup>0</sup> *<sup>C</sup><sup>j</sup>* , which is convergent if and only if *ρ(C) <* 1*.* In this case, *(I* <sup>−</sup> *C)*−<sup>1</sup> <sup>=</sup> 

$$(I-C)^{-1} = \sum\_{j=0}^{+\infty} C^j. \tag{9.17}$$

⎛

⎞

⎠

⎞

Now let *M*¯ be a nonsingular matrix and *ω >* 0 a scalar such that the matrix *C* = *<sup>I</sup>* <sup>−</sup> *ωM*¯ <sup>−</sup>1*<sup>A</sup>* satisfies *ρ(C) <* 1. Using (9.17), ⎝⎠

$$A^{-1} = \omega(\omega \bar{M}^{-1} A)^{-1} \bar{M}^{-1} = \omega(I - C)^{-1} \bar{M}^{-1} = \omega \left(\sum\_{j=0}^{+\infty} C^j\right) \bar{M}^{-1}.$$

Truncating the summation gives as a possible preconditioner ⎝⎠

⎛

$$M^{-1} = \alpha \left(\sum\_{j=0}^{k} C^j\right) \bar{M}^{-1}.$$

Observe that

$$I - M^{-1}A = I - \omega \left(\sum\_{j=0}^{k} C^j\right) \bar{M}^{-1} A = I - \left(\sum\_{j=0}^{k} C^j\right) (I - C) = C^{k+1},$$

which shows the positive effect of increasing *k*. If *A* and *M*¯ are SPD matrices, then *M* can be used with the CG method preconditioned from the left because *M*−1*A* is self-adjoint in the *M*¯ -inner product. Generalizations of the approach weight the powers of *C* in *M*−<sup>1</sup> using additional scalars. The choice of *M*¯ is crucial for the effectiveness of the approach.

#### *9.5.2 Schur Complement Approach and Deflation*

Many contemporary preconditioners are constructed hierarchically. A straightforward example is represented by the approximate solution of saddle point problems using the Schur complement approach. Consider the following general saddle point system: *R B x*1 

$$A\boldsymbol{x} = \begin{pmatrix} G \ C \\ R \ B \end{pmatrix} \begin{pmatrix} \boldsymbol{x}\_{\mathsf{l}} \\ \boldsymbol{x}\_{2} \end{pmatrix} = \begin{pmatrix} b\_{\mathsf{l}} \\ b\_{\mathsf{2}} \end{pmatrix}. \tag{9.18}$$

Assuming *G* is nonsingular, eliminating *x*<sup>1</sup> from the second block row yields the reduced system

$$Sx\_2 = b\_2 - RG^{-1}b\_1,\tag{9.19}$$

where *<sup>S</sup>* <sup>=</sup> *<sup>B</sup>* <sup>−</sup> *RG*−1*<sup>C</sup>* is the Schur complement of *<sup>G</sup>* in *<sup>A</sup>*. Solving (9.19) involves solving a linear system with *G* and with *S*. One option is to compute an LU factorization of *G* and then employ a preconditioned iterative method; this is **ALGORITHM 9.2 Simple Schur complement approach for saddle point systems Input:** Nonsingular saddle point system (9.18) with *G* nonsingular. **Output:** Computed solution *x*.


outlined in Algorithm 9.2. Combining direct and iterative techniques is sometimes referred to as a **hybrid** approach.

The Schur complement (or substructuring) approach can be extended to matrices that are split into more blocks. Blocks may arise naturally from the underlying application, but they can also be defined using purely algebraic rules. For example, consider an SPD matrix *A*. Applying graph partitioning techniques (such as the nested dissection approach of Section 8.4) to the adjacency graph G*(A)*, *A* can be symmetrically permuted to the doubly bordered block diagonal (DBBD) form *R B*

$$P^T A P = A\_{DB} = \begin{pmatrix} G\_D \ R^T \\ R & B \end{pmatrix},$$

where *GD* is an SPD block diagonal matrix (Section 8.5.1). *ADB* is a special case of a symmetric saddle point matrix. A block LDLT factorization of *ADB* is given by 

$$A\_{DB} = \begin{pmatrix} I & \\ RG\_D^{-1} & I \end{pmatrix} \begin{pmatrix} G\_D & \\ & S \end{pmatrix} \begin{pmatrix} I & G\_D^{-1}R^T \\ & I \end{pmatrix},$$

where the matrix *S* is the SPD Schur complement. The blocks within *GD* can be factorized in parallel using a sparse Cholesky solver. However, *S* is typically large and significantly denser than *B* and, in large-scale practical applications, it may not be possible to explicitly assemble and factorize it; in this case, a preconditioned iterative method is needed. If - 

*<sup>S</sup>*−<sup>1</sup> <sup>≈</sup> *<sup>S</sup>*−1, then an approximate block factorization of *<sup>A</sup>*−<sup>1</sup> *DB* is -

$$M^{-1} = \begin{pmatrix} I & -G\_D^{-1}R^T \\ & I \end{pmatrix} \begin{pmatrix} G\_D^{-1} \\ & \widetilde{S}^{-1} \end{pmatrix} \begin{pmatrix} I \\ -\mathcal{R} \, G\_D^{-1} & I \end{pmatrix} \cdot \boldsymbol{I}$$

Employing *M*−<sup>1</sup> as a preconditioner for *ADB* gives the preconditioned matrix -

$$M = \begin{pmatrix} \tilde{\Sigma}^{-1} & \tilde{\Sigma}^{-1} \end{pmatrix}$$

$$\text{preconditioner for } A\_{DB} \text{ gives the precor}$$

$$M^{-1}A\_{DB} = \begin{pmatrix} I & G\_D^{-1} \ R^T (I - \widetilde{S}^{-1} \ S) \\ \widetilde{S}^{-1} \ S \end{pmatrix}.$$

Applying *<sup>M</sup>*−<sup>1</sup> requires the efficient solution of linear systems with -*S*−1*S* and *GD*. As in other preconditioning approaches, bounding the condition number of the preconditioned matrix may be a useful indicator of the expected convergence of CG. The eigenvalues of *<sup>M</sup>*−<sup>1</sup>*ADB* are those of -*S*−1*S* and unity. Note that the spectrum of *M*−<sup>1</sup>*ADB* is the same as the spectrum of *M*−1*/*<sup>2</sup>*ADBM*−1*/*2. Thus, *κ(M*−1*/*<sup>2</sup>*ADBM*−1*/*2*)* depends on the extremal eigenvalues of -*S*−1*S*. A one-level preconditioner for *S* is obtained by setting -

$$
\widetilde{S}\_1^{-1} = B^{-1}.
$$

Let the matrix *B* be *m* × *m* and let *λ*<sup>1</sup> ≥ ··· ≥ *λm >* 0 be the eigenvalues of the generalized eigenvalue problem *Sz* = *λ*-

$$Sz = \lambda \bar{S}\_{\mathbb{I}} z.$$

Because -*<sup>S</sup>*−1*<sup>S</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> *<sup>B</sup>*−1*RG*−<sup>1</sup> *<sup>D</sup> <sup>R</sup><sup>T</sup>* , it follows that *<sup>λ</sup>*<sup>1</sup> <sup>≤</sup> <sup>1</sup> and so

$$\begin{aligned} \mathcal{T} - B^{-1} R G\_D^{-1} R^T, &\text{ it follows that } \lambda\_1 \le 1 \text{ and } \\\\ \kappa (\widetilde{S}\_1^{-1} S) = \kappa (\widetilde{S}\_1^{-1/2} S \, \widetilde{S}\_1^{-1/2}) &= \frac{\lambda\_1}{\lambda\_m} \le \frac{1}{\lambda\_m}, \end{aligned}$$

which is unbounded as *λm* approaches zero. In general, one-level algebraic preconditioners successfully bound the largest eigenvalues of the preconditioned matrix but encounter difficulties in controlling the smallest ones, which can lie close to the origin, hindering convergence. Strategies that involve a second-level component aim to overcome this and include **deflation** preconditioners and **domain decomposition** preconditioners.

The basic idea behind deflation is to "hide" parts of the spectrum from the CG method such that the CG iteration "sees" a system that has a much smaller condition number and hopefully a more favourable eigenvalue distribution than the original matrix. The part of the spectrum that is hidden is determined by the deflation subspace and the improvement in the convergence rate of the deflated CG method depends on the choice of this subspace. The ideal deflation subspace is the invariant subspace spanned by the eigenvectors corresponding to the smallest eigenvalues. There are practical cases showing convergence of the preconditioned iterative method may profit from this restriction of the spectrum to its "effective" part. To illustrate the approach, let be the *k* × *k* diagonal matrix with entries equal to the *k* smallest eigenvalues and let *Z* be the *m* × *k* matrix whose columns are the corresponding eigenvectors. A two-level deflation preconditioner is defined to be -<sup>2</sup> <sup>=</sup> *<sup>B</sup>*−<sup>1</sup> <sup>+</sup> *Z(*−<sup>1</sup> <sup>−</sup> *I )Z<sup>T</sup>* <sup>=</sup> -

$$
\widetilde{S}\_2^{-1} = B^{-1} + Z(\Lambda^{-1} - I)Z^T = \widetilde{S}\_1^{-1} + Z(\Lambda^{-1} - I)Z^T.
$$

In practice, challenges remain because and *Z* are typically not readily available.

#### *9.5.3 Domain Decomposition*

In the last section, the vertices V = {1*,* 2*,...,n*} of G*(A)* were partitioned into non-overlapping subsets. Alternatively, overlapping subsets (which are generally termed subdomains because the approach was originally proposed for problems that had an underlying grid) may be used. Domain decomposition methods based on overlapping subdomains are often referred to as **Schwarz methods**. Given *N >* 1, let *i* be the subset of size *ni* of vertices that are distance one in G*(A)* from the vertices in *I i* (1 ≤ *i* ≤ *N*). The overlapping subdomain *i* is defined to be *i* = [*I i, i*], with size *ni* = *ni* + *nI i*.

Associate with *i* an *ni* × *n* restriction (or projection) matrix given by *Ri* = *In(i,* :*)*. *Ri* maps from the global domain to subdomain *i*; its transpose *<sup>R</sup><sup>T</sup> <sup>i</sup>* is a prolongation matrix that maps from subdomain *i* to the global domain. The **onelevel additive Schwarz preconditioner** is defined to be *AS* = 

$$M\_{AS}^{-1} = \sum\_{l=1}^{N} R\_l^T A\_l^{-1} R\_l, \qquad A\_l = R\_l A R\_l^T. \tag{9.20}$$

Applying this preconditioner to a vector involves solving concurrent local problems in the overlapping subdomains. Increasing *N* reduces the sizes *ni* of the overlapping subdomains, leading to smaller local problems and faster computations. However, the preconditioned system using *M*−<sup>1</sup> *AS* may not be well conditioned and the convergence of the iterative solver may be inhibited. In fact, the local nature of this preconditioner can lead to a deterioration in its effectiveness as the number of subdomains increases because of the lack of global information from the matrix *A*. To maintain robustness with respect to *N*, an artificial subdomain is added to the preconditioner (also known as second-level or coarse space correction) that includes global information. Let 0 *< n*<sup>0</sup> *n*. If the *n*<sup>0</sup> × *n* matrix *R*<sup>0</sup> is of full row rank, the **two-level additive Schwarz preconditioner** is defined to be

$$M\_{AS2}^{-1} = M\_{AS}^{-1} + R\_0^T A\_0^{-1} R\_0, \qquad A\_0 = R\_0 A R\_0^T.$$

The coarse space correction can also be applied in a multiplicative way, which can lead to more robust variants. A sparse direct method can be used for the solves with each *Ai*, which has the advantage of being robust and is another example of a hybrid approach. Alternatively, for very large systems, incomplete IC factorization preconditioners or approximate inverse preconditioners and an iterative method can be used. While this may result in a slower convergence rate, it can lead to a faster method overall because each iteration is less expensive (and may be the only option if the direct solver requires too much memory). Generalizing the approach to a hierarchy of additions of artificial domains leads to the class of **multilevel** methods. Again, employing them as preconditioners requires solves with the domain matrices, which can be based on sparse direct methods or preconditioned iterative methods.

An attractive feature of domain decomposition methods is that they are naturally parallel because all subdomain computations can be performed simultaneously. The restricted additive Schwarz preconditioner is obtained by a simple and efficient change that removes the overlap in the prolongation, replacing (9.20) by 

## preconditioner is obtained

lap in the prolongation, r

 $M\_{RAS}^{-1} = 
\sum\_{i=1}^{N} 
\widehat{R}\_i^T A\_i^{-1} R\_i$ .

where *R <sup>i</sup>* = *In(I i,* :*)*. The main motivation here is to reduce the communication cost by half because computing products such as *R iw* does not involve any data exchange with neighbouring processors.

#### **9.6 Notes and References**

A useful textbook on iterative methods is Saad (2003b). It includes the result stated in Theorem 9.3, while the proof of Theorem 9.2 is given in Mendelsohn (1956). Other key books include Meurant (1999), van der Vorst (2003), and the recent monograph of Bertaccini & Durastante (2018), as well as Liesen & Strakoš (2013) and Meurant & Duintjer Tebbens (2020), which targets theoretical and practical properties of iterative methods. The excellent surveys of Benzi (2002) and Wathen (2015), Pearson & Pestana (2020) present overviews of preconditioning techniques and the monograph Chen (2005) describes several approaches and includes many example applications, while Bollhöfer (2015) gives a practically oriented survey that mainly targets multilevel and parallel aspects of algebraic preconditioners. A discussion of the desirable properties of preconditioners can be found in Chow & Saad (1997). More sophisticated dropping strategies and the relation between ILU factorizations and factorized approximate inverses are considered by Bollhöfer & Saad (2002, 2006); while Kopal et al. (2016) discuss adaptive dropping.

For a basic introduction to the stability problems of LU-based preconditioners, see Elman (1986, 1989). The Eisenstat trick of Section 9.2.3 is presented by Eisenstat (1981). An interesting discussion putting this into the context of other similar ideas is given in Ortega (1988a).

The issue of potential breakdown during incomplete factorizations was pointed out by Kershaw (1978). This strengthened interest in classes of matrices for which breakdown cannot occur. Theorem 9.4 for M-matrices is from Meijerink & van der Vorst (1977); the extension to H-matrices is given independently by Manteuffel (1980) and Varga et al. (1980). Favourable asymptotic bounds for the condition number of M-matrices preconditioned by modified incomplete factorizations were an important impetus behind the development of algebraic preconditioners. These are described in Axelsson (1972) and Gustafsson (1978, 1979), but see also the early sophisticated analysis of relaxation methods presented in Dupont et al. (1968). Some of the assumptions that were used to obtain early asymptotic bounds were later shown to be unnecessary (Bern et al., 2006). Practical choices of polynomial

preconditioners, particularly for SPD systems, are discussed in the book by Saad (2003b) (and the earlier introductory paper of Saad, 1985). Note the recent interest of Loe & Morgan (2021) and Ye et al. (2021), the former motivated by the potential to reduce communication in parallel computing.

For preconditioning saddle point problems using algebraic approaches, the highly cited survey of Benzi et al. (2005) and monograph of Rozložník (2018) are good starting points. We also refer to the papers by Maryška et al. (1996, 2000a,b) and Arioli et al. (2006) on the iterative solution of algebraically preconditioned saddle point problems from PDE applications.

There are a number of monographs on domain decomposition methods. An important algorithmically oriented introduction is Smith et al. (1996), but see also Quarteroni & Valli (1999) and Toselli & Widlund (2005) as well as the books by Olshanskii & Tyrtyshnikov (2014) and Dolean et al. (2015), which emphasize connections to PDEs and solution techniques motivated by them. We recommend the paper of Tang et al. (2009) for an algebraic comparison of different classes of domain decomposition and deflation preconditioners. A further line of research resulting in general algebraic preconditioners has been developed using hierarchical matrices; the papers include Bebendorf & Fischer (2008) and Bebendorf et al. (2013) and the monograph on hierarchical matrices of Bebendorf (2008). The ShyLU software package developed by Rajamanickam et al. (2012) is a fully algebraic hybrid package for solving sparse linear systems using domain decomposition methods. It offers distributed memory domain decomposition solvers and node level solvers and kernels that support the distributed memory solvers. The node level solvers include sparse LU and Cholesky factorizations, a multithreaded triangular solver, and a fast iterative ILU algorithm. ShyLU is available as part of Trilinos (ShyLU Project Team, 2022).

Algebraic multigrid (AMG) methods are another important class of frequently used methods. AMG methods can be used to precondition a wide spectrum of problems, but their development has been mainly motivated by systems arising from the discretization of PDEs, often exploiting specific properties of discretized models. A recommended overview is by Xu & Zikatanov (2017); see also Stüben et al. (2017).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 10 Incomplete Factorizations**

*They [incomplete factorizations] can be thought of as approximating the exact LU factorization of a given matrix A (e.g. computed via Gaussian elimination) by disallowing certain fill-ins. As opposed to other PDE-based preconditioners such as multigrid and domain decomposition, this class of preconditioners are primarily algebraic in nature and can in principle be applied to any sparse matrices. When applied to PDE problems, they are usually not optimal ... On the other hand, they are often quite robust. – Chan & van der Vorst (1997).*



Having introduced incomplete factorization preconditioners in the previous chapter, the focus in this chapter is on different ways to compute such factorizations and their relationship to the complete factorizations used in sparse direct methods. We denote the incomplete factors by *L* and *U* -; in the SPD case, *U* - = *L* -*<sup>T</sup>* . We assume that the sparsity patterns of *A* and its incomplete factors always include the positions of the diagonal entries.

#### **10.1 ILU(0) Factorization** --

The simplest sparsity pattern for an incomplete factorization is S{*L* + *U* } = S{*A*}, that is, no entries in *L* or *U* are allowed outside the sparsity pattern of *A* and only entries in positions *(i, j )* ∈ S{*A*} are retained in the (incomplete) elimination matrices. The resulting incomplete factorization is called an ILU(0) factorization (or an IC(0) factorization if *A* is SPD).

Motivation for considering a sparsity pattern that is a superset of S{*A*} is given by the following straightforward but important result. ------

**Theorem 10.1 (Chan & van der Vorst 1997; van der Vorst 2003)** *Consider the incomplete LU factorization A* + *E* = *L U with sparsity pattern* S{*L* + *U* }*. The entries of the error matrix E are zero at positions (i, j )* ∈ S{*L* + *U* }*.*





⎞


*Proof* The result clearly holds for *j* = 1. Let *(i, j )* ∈ S{*L* +*U* } and assume without loss of generality that *i>j>* 1. The *(i, j )* entry of *L* is computed as ⎝⎠

⎛

$$\begin{array}{l} \text{holds for } j = 1. \text{ Let } (i, j) \in \mathcal{S} \{ \vec{L} \} \\ \cdot \cdot j > 1. \text{ The } (i, j) \text{ entry of } \widetilde{L} \text{ is} \\\\ \widetilde{l}\_{ij} = \left( a\_{ij} - \sum\_{k=1}^{j-1} \widetilde{l}\_{ik} \,\tilde{\mu}\_{kj} \right) / \widetilde{\mu}\_{jj} \end{array}$$

with the sums over *k* implying *(i, k)* ∈ S{*L* +*U* } and *(k, j )* ∈ S{*L* +*U* }. This gives

$$a\_{lj} = \widetilde{L}\_{l,1:j-1} \widetilde{U}\_{1:j-1,j} + \widetilde{l}\_{lj} \widetilde{u}\_{jj} = \widetilde{L}\_{l,1:j} \widetilde{U}\_{1:j,j} = L\_{l,1:j} U\_{1:j,j},$$

and the corresponding entry of *E* is zero.


A consequence of Theorem 10.1 is that extending S{*L* + *U* } gives a larger set of entries of *A* for which the error is zero. This is attractive provided the incomplete factorization can still be computed and employed cheaply and does not require prohibitive amounts of memory. In some situations, there are straightforward ways to extend S{*L* - + *U* -}. For example, consider a simple discretization of a PDE on a rectangular grid. The sparsity pattern of the corresponding SPD matrix *A* and its graph G*(A)* together with the first three steps of the Cholesky factorization of *A* (in which variables 1, 2, and 3 are eliminated in turn) are given in Figure 10.1. *A* has entries on the diagonal and four of its subdiagonals and the fill-in lies within *band(A)*. A natural choice is to allow S{*L* - + *U* -} to include fill-in along a few additional diagonals within the band.

**Figure 10.1** An 8 × 8 banded sparse SPD matrix *A* and its graph G*(A)*. The first three steps of a Cholesky factorization are shown. Filled entries are denoted by *f* .

### **10.2 Basic Incomplete Factorizations**

We start with the two basic incomplete factorizations. Here and elsewhere, section notation is used but operations are performed only on nonzero entries. The Crout variant given in Algorithm 10.1 computes *U* row-by-row and *L* column-by-column and sparsifies each row and column as soon as they are computed using a target sparsity pattern S{*L* - + *U* -}. The widely used variant outlined in Algorithm 10.2 constructs both *L* and *U* by rows. Prescribing an appropriate sparsity pattern in advance can be difficult. If it is not supplied, sparsification can be applied inside the *k* loops (for instance, entries with absolute value less than a chosen tolerance may be dropped) and the sparsity patterns of the factors updated as the factorization proceeds. --



Algorithms 10.1 and 10.2 are straightforward to implement using sparse data structures. At major step *i*, Algorithm 10.2 computes *L i,*1:*i*−<sup>1</sup> and *U i,i*+1:*n*; both rows can be held using a single auxiliary vector. Note that, in Algorithm 10.1, sparsification of the partially computed vectors is performed outside the *k* loops, whereas in Algorithm 10.2 it is inside the *k* loop. In practice, either approach can be used, leading to slightly different variants. --



#### **ALGORITHM 10.1 Crout incomplete LU factorization**

**Input:** Matrix *A* and, optionally, a target sparsity pattern S{*L* + *U* }. **Output:** Incomplete LU factorization *A* ≈ *L U* . --

1: **for** *j* = 1 : *n* **do** 2: ˜*ljj* = 1, *L <sup>j</sup>*+1:*n,j* = *Aj*+1:*n,j* 3: *U j,j* :*<sup>n</sup>* = *Aj,j* :*<sup>n</sup>* 4: **for** *k* = 1 : *j* − 1 such that *(j, k)* ∈ S{*L* -} **do** 5: *U j,j* :*<sup>n</sup>* = *U j,j* :*<sup>n</sup>* − ˜*ljk U k,j* :*<sup>n</sup>* Sparse linear combination 6: **end for** 7: Sparsify *U j,j*+1:*<sup>n</sup>* Drop entries from row *j* of *U* - 8: **for** *k* = 1 : *j* − 1 such that *(k, j )* ∈ S{*U* -} **do** 9: *L <sup>j</sup>*+1:*n,j* = *L <sup>j</sup>*+1:*n,j* − ˜*ukj L <sup>j</sup>*+1:*n,k* Sparse linear combination 10: **end for** 11: Sparsify *L <sup>j</sup>*+1:*n,j* Drop entries from column *j* of *L* - 12: *L <sup>j</sup>*+1:*n,j* = *L <sup>j</sup>*+1:*n,j /u*˜*jj* 13: **end for**

#### **ALGORITHM 10.2 Row incomplete LU factorization**

**Input:** Matrix *A* and, optionally, a target sparsity pattern S{*L* + *U* }. **Output:** Incomplete LU factorization *A* ≈ *L U* . --



```
1: for i = 1 : n do
 2: ˜lii = 1, L
                 i,1:i−1 = Ai,1:i−1
 3: U
         i,i:n = Ai,i:n
 4: Sparsify L
               -
                 1,1:i−1 and U
                            -
                             i,i+1:n
 5: for k = 1 : i − 1 such that (i, k) ∈ S{L
                                           -
                                            } do
 6: ˜lik = ˜lik/u˜kk
 7: L
          -
            i,k+1:i−1 = L
                       -
                         i,k+1:i−1 − ˜lik U
                                       -
                                        k,k+1:i−1
 8: Sparsify L
                   -
                     i,k+1:i−1
 9: U
           -
            i,i:n = U
                   -
                    i,i:n − ˜lik U
                              -
                               k,i:n
10: Sparsify U
                   -
                     i,i+1:n
11: end for
12: end for
```
### **10.3 Incomplete Factorizations Based on the Shortest Fill-Paths**

We next consider an incomplete LU factorization that uses a structure-based dropping strategy. Entries of the factors that correspond to nonzero entries of *A* are assigned the level 0, while each potential filled entry in position *(i, j )* is assigned a level as follows:

$$level(i,j) = \min\_{1 \le k < \min\{l, j\}} (level(i,k) + level(k,j) + 1). \tag{10.1}$$

Given ≥ 0, during the factorization, a filled entry is permitted at position *(i, j )* provided *level(i, j )* ≤ . The resulting **level-based** incomplete factorization is denoted by ILU() (or IC()); the basic row variant is given in Algorithm 10.3. --

Figure 10.2 depicts S{*L* + *L <sup>T</sup>* } for the IC() factorization of *<sup>A</sup>* from the discretized Laplace equation on a square grid (see the smaller problem in (9.14)) and for a matrix with a more general symmetric sparsity structure. The fill-in is typically generated irregularly throughout the factorization: initially few updates are needed, but later steps involve many updates, leading to large amounts of dropping. Furthermore, the amount of fill-in can grow quickly with increasing and, as a result, is typically small and level-based dropping is often combined with threshold-based dropping or with sparsifying *A* before the factorization commences (for example, by discarding entries of *A* with small absolute values).



**ALGORITHM 10.3 Level-based incomplete LU factorization Input:** Matrix *A* and the level parameter ≥ 0. **Output:** ILU*()* factorization *A* ≈ *L U* . --



1: Initialise *level* to 0 for nonzeros and diagonal entries of *A* and to *n*+1 otherwise 2: **for** *i* = 1 : *n* **do** Loop over rows 3: ˜*lii* = 1, *L i,*1:*i*−<sup>1</sup> = *Ai,*<sup>1</sup>:*i*−<sup>1</sup> and *U i,i*:*<sup>n</sup>* = *Ai,i*:*<sup>n</sup>* Initialise row *i* of *L* and *U* 4: **for** *k* = 1 : *i* − 1 such that *level(i, k)* ≤ **do** 5: ˜*lik* = ˜*lik/u*˜*kk* 6: **for** *j* = *k* + 1 : *i* − 1 **do** 7: ˜*lij* = ˜*lij* − ˜*lik u*˜*kj* and update *level(i, j )* 8: **end for** 9: **for** *j* = *i* : *n* **do** 10: *u*˜*ij* = ˜*uij* − ˜*lik u*˜*kj* and update *level(i, j )* 11: **end for** 12: **end for** 13: **for** *k* = 1 : *i* − 1 **do** Drop entries in row *i* for which *level* is too high 14: **if** *level(i, k) >*  **then** ˜*lik* = 0 15: **end for** 16: **for** *k* = *i* : *n* **do** 17: **if** *level(i, k) >*  **then** *u*˜*ik* = 0 18: **end for** 19: **end for**

The level-based strategy comes from observing that in practical examples the absolute values of the entries in the factors in positions for which *level* is large are often small. This is the case for model problems arising from discretized PDEs. A closer look shows a surprising connection between the level-based ILU factorization and the complete factorization: entries with large values of *level* correspond to long fill-paths. This is expressed in Theorem 10.2, which allows the sparsity patterns of the incomplete factors to be determined a priori.

**Theorem 10.2 (Hysom & Pothen 2002)** *Consider the ILU() factorization of A. level(i, j )* = *k for some k* ≤ *if and only if there is a shortest fill-path i* ⇒ *j of length k* + 1 *in the adjacency graph* G*(A).* --

Algorithm 10.4 outlines finding the pattern of row *i* of *U* ; finding the pattern of columns of *L* is analogous. Only G*(A)* is required, and hence the sparsity pattern of each row in the factor can be computed independently, in parallel. The algorithm operates via a simple breadth-first search that finds a shortest path between vertex

**Figure 10.2** The sparsity patterns of the IC*()* factors of *A* from the discretized Laplace equation on a square grid (top) and a more general symmetric sparse matrix (bottom).

*i* and vertices reachable from *i* via a graph traversal of *l* + 1 or fewer edges. The correctness of the algorithm follows from Theorem 10.2.






#### **10.4 Modifications Based on Maintaining Row Sums**

We assume in this section that the target sparsity pattern S{*L* + *U* } contains S{*A*}. **Modified incomplete factorizations** (MILU or MIC in the SPD case) seek to maintain equality between the row sums of *A* and *L U* , that is, *L Ue* - = *Ae* (*e* is the vector of all ones). Rather than discarding potential fill-in outside the target sparsity pattern, the approach subtracts it from the diagonal entries of *U* -; this is outlined in Algorithm 10.5. Note that an MILU factorization may break down. If the target sparsity pattern corresponds to that of an ILU() factorization, then an MILU(*)* factorization is computed. ----

Equality of the row sums of *A* and *L U* can be seen as follows. If all the filled entries are retained (that is, S{*L* + *U* } = S{*L* + *U*}), then the claim holds trivially. Now assume some filled entries are not kept. If an entry in column *j* of row *i* of *A* belongs to the target sparsity pattern, then its value is modified in Step 8 if *i* ≤ *j* or in Step 15 if *i>j* . Otherwise, the *i*-th diagonal entry of *U* is modified (Step 10 or Step 17). In each case, ˜*lik u*˜*kj* is subtracted from entries of the *i*-th row of the incomplete factors. Consider row *i* of *L* -*U* -. This product is given by

 

#### **ALGORITHM 10.4 Find the sparsity pattern of row** *i* **of the ILU***()* **factor** *U* **of** *A* ----

**Input:** Graph G*(A)*, the level parameter ≥ 0 and row index *i*. **Output:** Sparsity pattern S{*U i,i*:*n*} of row *i* of the ILU*()* factorization *A* ≈ *L U* .

1: S{*U i,i*:*n*}={*i*}, Q = {*i*} Queue holds *i* initially 2: *length(i)* = 0 3: *visited(i)* = *i* 4: **while** Q is not empty **do** 5: *pop(*Q*, k)* Take *k* from the queue 6: **for** *j* ∈ *adj*G*(A)(k)* with *visited(j )* = *i* **do** 7: *visited(j )* = *i* 8: **if** *j<i* and *length(k) <*  **then** 9: *append(*Q*,j)* Add *j* to the queue 10: *length(j )* = *length(k)* + 1 11: **else if** *j>i* **then** 12: S{*U i,i*:*n*} = S{*U i,i*:*n*}∪{*j* } Add *j* to the sparsity pattern of row *i* 13: **end if** 14: **end for** 15: **end while**

$$\begin{split} \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \sum\_{k=j}^{n} \tilde{u}\_{jk} &= \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \tilde{u}\_{jj} + \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \sum\_{k=j+1}^{n} \tilde{u}\_{jk} + \sum\_{k=i}^{n} \tilde{u}\_{ik} = \\ &= \sum\_{j=1}^{i-1} \left( a\_{lj} - \sum\_{k=1}^{j-1} \tilde{l}\_{lk} \tilde{u}\_{kj} \right) + \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \sum\_{k=j+1}^{n} \tilde{u}\_{jk} + \sum\_{k=i}^{n} \left( a\_{lk} - \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \tilde{u}\_{jk} \right) \\ &= \sum\_{j=1}^{n} a\_{lj} + \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \sum\_{k=j+1}^{n} \tilde{u}\_{jk} - \left( \sum\_{j=1}^{i-1} \sum\_{k=1}^{j-1} \tilde{l}\_{lk} \tilde{u}\_{kj} + \sum\_{k=i}^{n} \sum\_{j=1}^{i-1} \tilde{l}\_{lj} \tilde{u}\_{jk} \right). \end{split}$$

Rearranging the indices in the double summations, the last three sums cancel out. Moreover, the added double summation is the sum of all the modification terms ˜*lik u*˜*kj* in Algorithm 10.5, and the sum of the two subtracted double summations also comprises all the modification terms. Consequently, the row sums of *A* are preserved in the product of the incomplete factors.

Theorem 10.3 provides motivation for maintaining constant row sums in the case of a model PDE problem. The result is also valid for Neumann or mixed boundary conditions, and there are extensions to three-dimensional problems and MIC()

#### **ALGORITHM 10.5 Modified incomplete factorization (MILU)** --

**Input:** Matrix *A* = *LA*+*DA*+*UA* (see (9.6)) and a target sparsity pattern S{*L* +*U* } containing S{*A*}. 1: ˜*lij* <sup>=</sup> *(I* <sup>+</sup> *LA)ij* for all *(i, j )* <sup>∈</sup> <sup>S</sup>*(L)* -<sup>S</sup>*(LA)* <sup>⊆</sup> <sup>S</sup>*(L)* -

**Output:** Incomplete LU factorization *A* ≈ *L U* .

 2: *<sup>u</sup>*˜*ij* <sup>=</sup> *(DA* <sup>+</sup> *UA)ij* for all *(i, j )* <sup>∈</sup> <sup>S</sup>*(U )* - <sup>S</sup>*(UA)* <sup>⊆</sup> <sup>S</sup>*(U )* - 3: **for** *k* = 1 : *n* − 1 **do** 4: **for** *i* = *k* + 1 : *n* such that *(i, k)* ∈ S{*L* -} **do** 5: ˜*lik* = ˜*lik/u*˜*kk* Check that *u*˜*kk* is nonzero 6: **for** *j* = *i* : *n* such that *(k, j )* ∈ S{*U* -} **do** 7: **if** *(i, j )* ∈ S{*U* -} **then** 8: *u*˜*ij* = ˜*uij* − ˜*lik u*˜*kj* 9: **else** 10: *u*˜*ii* = ˜*uii* − ˜*lik u*˜*kj* Modify diagonal instead of creating fill-in 11: **end if** 12: **end for** 13: **for** *j* = *k* + 1 : *i* − 1 such that *(k, j )* ∈ S{*U* -} **do** 14: **if** *(i, j )* ∈ S{*L* -} **then** 15: ˜*lij* = ˜*lij* − ˜*lik u*˜*kj* 16: **else** 17: *u*˜*ii* = ˜*uii* − ˜*lik u*˜*kj* Modify diagonal instead of creating fill-in 18: **end if** 19: **end for** 20: **end for** 21: **end for**

with  *>* 0. However, although Theorem 10.1 holds for MILU factorizations, the approach may not be useful for general *A*.

**Theorem 10.3 (Gustafsson 1978; Bern et al. 2006)** *Let A come from a discretized Poisson problem on a uniform two-dimensional rectangular grid with Dirichlet boundary conditions and discretization parameter h. Then the condition number κ((L* -*U )* -<sup>−</sup>1*A) for the level-based MIC(0) preconditioner is O(h*−1*).*

Optionally, in Steps 10 and 17 of Algorithm 10.5, the update term ˜*lik u*˜*kj* may be multiplied by a parameter *θ* (0 *<θ<* 1) before it is subtracted from the diagonal entry *u*˜*ii*. The introduction of *θ* was proposed as a practical way to extend MILU to linear systems not coming from discretized PDEs. Clearly, using *θ <* 1 reduces the amount by which the diagonal entries are modified.

### **10.5 Dynamic Compensation**

As discussed in Section 9.4.1, dropping entries can lead to breakdown. One way to avoid this (in exact arithmetic) is to dynamically modify the computed entries; this is outlined as Algorithm 10.6. Instead of accepting a filled entry in position *(i, j )*, the idea is to add a (weighted) multiple of its absolute value to the corresponding diagonal entries *u*˜*ii* and *u*˜*jj* . Provided the number of modifications is small, this can be useful if *A* is a diagonally dominant matrix and scaled so that its diagonal entries are nonnegative. The parameter *ω* controls the amount by which the diagonal entries of *U* are modified, but if *ω <* 1, then breakdown can still occur. Dynamic compensation can be successful when incorporated into an IC factorization of --

#### **ALGORITHM 10.6 ILU factorization with dynamic compensation** --

**Input:** Matrix *A* = *LA* + *DA* + *UA* (see (9.6)), a target sparsity pattern S{*L* + *U* } and parameter *ω* (0 ≤ *ω* ≤ 1). **Output:** Incomplete LU factorization *A* ≈ *L U* . 1: ˜*lij* <sup>=</sup> *(I* <sup>+</sup> *LA)ij* for all *(i, j )* <sup>∈</sup> <sup>S</sup>*(L)* -

 2: *<sup>u</sup>*˜*ij* <sup>=</sup> *(DA* <sup>+</sup> *UA)ij* for all *(i, j )* <sup>∈</sup> <sup>S</sup>*(U )* - 3: **for** *k* = 1 : *n* − 1 **do** 4: **for** *i* = *k* + 1 : *n* such that *(i, k)* ∈ S{*L* -} **do** 5: ˜*lik* = ˜*lik/u*˜*kk* 6: **for** *j* = *i* : *n* such that *(k, j )* ∈ S{*U* -} **do** 7: **if** *(i, j )* ∈ S{*U* -} **then** 8: *u*˜*ij* = ˜*uij* − ˜*lik u*˜*kj* 9: **else** 10: *<sup>ρ</sup>* <sup>=</sup> *(u*˜*ii/u*˜*jj )*1*/*<sup>2</sup> 11: *u*˜*ii* = ˜*uii* + *ωρ* |˜*lik u*˜*kj* |, *u*˜*jj* = ˜*ujj* + *ω*|˜*lik u*˜*kj* | */ρ*, *u*˜*ij* = 0. 12: **end if** 13: **end for** 14: **for** *j* = *k* + 1 : *i* − 1 such that *(k, j )* ∈ S{*U* -} **do** 15: **if** *(i, j )* ∈ S{*L* -} **then** 16: ˜*lij* = ˜*lij* − ˜*lik u*˜*kj* 17: **else** 18: *<sup>ρ</sup>* <sup>=</sup> *(u*˜*ii/u*˜*jj )*1*/*<sup>2</sup> 19: *u*˜*ii* = ˜*uii* + *ωρ* |˜*lik u*˜*kj* |, *u*˜*jj* = ˜*ujj* + *ω*|˜*lik u*˜*kj* | */ρ*, ˜*lij* = 0. 20: **end if** 21: **end for** 22: **end for** 23: **end for**

an SPD matrix *A* because the resulting local modifications correspond to adding positive semidefinite matrices to *A*. In practice, the behaviour of the resulting preconditioner can be very different from that computed using the MIC approach of the previous section.

A related scheme, called **diagonally compensated reduction**, modifies *A* before the factorization begins by adding the values of all of its positive off-diagonal entries to the corresponding diagonal entries and then setting these off-diagonal entries to zero. If *A* is SPD, then the resulting matrix is a symmetric M-matrix and the incomplete factorization will not break down (Theorem 9.4). However, the modified matrix may be too far from *A* for its incomplete factors to be useful.

### **10.6 Memory-Limited Incomplete Factorizations**

We next consider a more sophisticated modification scheme that introduces the use of intermediate memory that is employed during the construction of the incomplete factors but is then discarded. The aim is to obtain a high quality preconditioner while maintaining sparsity and allowing the user to control how much memory is used (both in the construction of the preconditioner and in the incomplete factor *L* -). Let the matrix *A* be SPD and consider the decomposition - + *R) (* -- + *R)* --

$$A = (\widetilde{L} + \widetilde{R})\left(\widetilde{L} + \widetilde{R}\right)^T - E.$$

Here the incomplete factor *L* is a lower triangular matrix with positive diagonal entries, *R* is a strictly lower triangular matrix with "small" entries, and the error matrix is *E* = *R R <sup>T</sup> .* At each step, the next column of *L* is computed, and then the remaining Schur complement is modified. On step *j* of the incomplete factorization, the first column of the Schur complement *S(j )* is split into the sum ---

$$
\widetilde{L}\_{j:n,j} + \widetilde{\mathcal{R}}\_{j:n,j},
$$

where *L <sup>j</sup>* :*n,j* contains the entries that are retained in column *j* of the final incomplete factorization, *(R) jj* = 0 and *R <sup>j</sup>*+1:*n,j* contains the entries that are discarded. If a complete factorization was being computed, then the Schur complement would be updated by subtracting ----

$$(\widetilde{L}\_{j+1:n,j} + \widetilde{R}\_{j+1:n,j}) \left(\widetilde{L}\_{j+1:n,j} + \widetilde{R}\_{j+1:n,j}\right)^T \dots$$

However, the incomplete factorization discards the term

$$E^{(j)} = \widetilde{\mathcal{R}}\_{j+1:n,j} \, \widetilde{\mathcal{R}}\_{j+1:n,j}^T.$$




$$
\begin{pmatrix}
\ast & \ast & \ast & \delta & \delta \\
\ast & f & f \\
\ast & \ast & f \\
\delta & & & \end{pmatrix}
\qquad
\begin{pmatrix}
\ast & \ast & \ast & \delta & \delta \\
\ast & f & f & f & f \\
\ast & f & f & f & f \\
\delta & f & f & \end{pmatrix}
$$

**Figure 10.3** An illustration of the fill-in in a standard sparsification-based IC factorization (left) and in the approach that uses intermediate memory (right) after one step of the factorization. Entries with a small absolute value in row and column 1 are denoted by *δ*. The filled entries are denoted by *f* .

$$
\begin{pmatrix}
\ast & \ast & \delta & & \ast & \ast \\
\ast & \ast & \ast & & \ast & \\
& \ast & \ast & \ast & & \\
& \ast & & \ast & \ast & \\
& & \ast & & \ast & \ast
\end{pmatrix}
\qquad
\begin{pmatrix}
\ast & & & & & \\
& \ast & \ast & & & \\
& & \ast & \ast & & \\
& & \ast & \ast & \ast & \\
& & \ast & \ast & \ast & \\
& & & \ast & \ast & \ast
\end{pmatrix}
$$

**Figure 10.4** On the left is an SPD matrix with an entry of small absolute in positions *(*1*,* 3*)* and *(*3*,* 1*)*. In the centre is S{*L* } computed using a standard IC factorization that drops the small entry *δ* at position *(*3*,* 1*)* (there are no filled entries in this case). On the right is the lower triangular part of the elimination matrix after the first step of the incomplete factorization using intermediate memory. The filled entry is denoted by *f* .

Thus, the matrix *E(j )* is implicitly added to *A*, and because *E(j )* is positive semidefinite, the approach is naturally breakdown-free.


The obvious choice for *R <sup>j</sup>*+1:*n,j* is the smallest off-diagonal entries in the column (those that are smaller in absolute value than a chosen tolerance). Then implicitly adding *E(j )* is combined with the standard steps of an IC factorization, with entries dropped from *L* after the updates have been applied to the Schur complement. --

Figure 10.3 depicts the first step of this approach. In the first row and column, ∗ and *δ* denote the entries of *L* <sup>1</sup>:*n,*<sup>1</sup> and *R* <sup>1</sup>:*n,*1, respectively. Because a standard sparsification scheme does not store the smallest entries, using such a scheme gives no fill-in in the rows and columns corresponding to the discarded entries; this is shown on the left. The fill-in in the factorization that uses intermediate memory is depicted on the right. Clearly, more filled entries are used in constructing *L* -.

This strategy enables the structure of the complete factorization to be followed more closely than is possible using a standard approach. This is illustrated in Figure 10.4. If the small entries at positions *(*1*,* 3*)* and *(*3*,* 1*)* are not discarded, then there is a filled entry in position *(*3*,* 2*)* and this allows the incomplete factorization using intermediate memory to involve the (large) off-diagonal entries in positions *(*5*,* 2*)* and *(*6*,* 2*)* in the second step of the IC factorization. -

Unfortunately, because the column *R <sup>j</sup>*+1:*n,j* must be retained to perform the updates on the next step, the total memory requirements are essentially as for a

#### **ALGORITHM 10.7 Crout memory-limited IC factorization**

**Input:** SPD matrix *A*, memory control parameters *lsize >* 0 and *rsize* ≥ 0. **Output:** Incomplete Cholesky factorization *A* ≈ *L L T* .



```
1: wi = 0, 1 ≤ i ≤ n
2: for j = 1 : n do
3: for i = j : n such that aij 
= 0 do
4: wi = aij
5: end for
6: for k<j such that ˜ljk 
= 0 do
7: for i = j : n such that ˜lik 
= 0 do
8: wi = wi − ˜lik ˜ljk
9: end for
10: for i = j : n such that r˜ik 
= 0 do
11: wi = wi − ˜rik ˜ljk
12: end for
13: end for
14: for k<j such that r˜jk 
= 0 do
15: for i = j : n such that ˜lik 
= 0 do
16: wi = wi − ˜lik r˜jk
17: end for
18: end for
19: Copy into L
                -
                 j :n,j the lsize+nz(Aj :n,j ) entries of w of largest absolute value
20: Copy into R
                -
                 j+1:n,j the rsize entries of w that are the next largest in absolute
   value
21: Scale ˜ljj = (wj )1/2, L
                           -
                            j+1:n,j = L
                                     -
                                       j+1:n,j /˜ljj , R
                                                   -
                                                    j+1:n,j = R
                                                              -
                                                               j+1:n,j /˜ljj
22: Reset entries of w to zero.
23: end for
24: Optionally discard R
                     -
                       R
                                   -
                                     is often only used in the construction of L
                                                                         -

                                                                         -
```
complete factorization. However, the memory can be reduced by introducing two drop tolerances so that only entries of absolute value at least *τ*<sup>1</sup> are kept in *L* and entries smaller than *τ*<sup>2</sup> are dropped from *R* . The factorization is no longer guaranteed to be breakdown-free. Furthermore, the numbers of entries in *L* and *R* are not known a priori. --


An alternative idea that limits both the number of entries in the incomplete factor and the intermediate memory is to fix the maximum number of entries in each column of *L* and *R* . This is outlined in Algorithm 10.7. Here *lsize* ≥ 0 and *rsize* ≥ 0 are the maximum number of filled entries in each column of *L* and the maximum number of entries in each column of *R* -, respectively, and *nz(Aj* :*n,j )* denotes the number of entries in the lower triangular part of column *j* of *A*. The number of entries in *L* is less than *nz(A)*+*(n*−1*)lsize* (where *nz(A)* is the number of entries in the lower triangular part of *A*) and *R* has at most *(n*−1*)rsize* entries. If the parameter *rsize* is set to 0, then no intermediate memory is used but in general choosing *rsize >* 0 leads to the computed *L* being a higher quality preconditioner. In case of breakdown, the algorithm can incorporate the use of a global shift; see Algorithm 9.1.


### **10.7 Fixed-Point Iterations for Computing ILU Factorizations**

The fixed-point ILU algorithm is fundamentally different from Gaussian elimination-based approaches. Given the target sparsity pattern S{*L* + *U* }, the goal is to iteratively generate incomplete factors fulfilling the ILU property -*U )* -----

$$(\ddot{L}\ddot{U})\_{lj} = a\_{lj}, \quad (i,j) \in \mathcal{S}\{\ddot{L} + \ddot{U}\}$$

(see Theorem 10.1). The idea is appealing because the entries of *L* and *U* can be computed iteratively in parallel using the constraints --

$$\sum\_{\substack{k=1\\(l,k),(k,j)\in\mathcal{S}\{\widetilde{L}+\widetilde{U}\}}}^{\min(l,j)} \widetilde{l}\_{lk}\widetilde{u}\_{kj} = a\_{lj}, \quad (i,j) \in \mathcal{S}\{\widetilde{L}+\widetilde{U}\},$$

and the normalization ˜*lii* = 1. Using the relations ⎝⎠

$$\ln \tilde{l}\_{li} = 1. \text{ Using the relations}$$

$$\tilde{l}\_{lj} = \left( a\_{lj} - \sum\_{k=1}^{j-1} \tilde{l}\_{lk} \tilde{u}\_{kj} \right) / \tilde{u}\_{jj}, \quad i > j,\tag{10.2}$$

$$\tilde{u}\_{lj} = a\_{lj} - \sum\_{k=1}^{i-1} \tilde{l}\_{lk} \tilde{u}\_{kj}, \quad i \le j,\tag{10.3}$$

$$
\tilde{u}\_{lj} = a\_{lj} - \sum\_{k=1}^{i-1} \tilde{l}\_{lk} \tilde{u}\_{kj}, \quad i \le j,\tag{10.3}
$$

the approach can be formulated as a fixed-point iteration method of the form *<sup>w</sup>k*+<sup>1</sup> <sup>=</sup> *f (wk)*, *<sup>k</sup>* <sup>=</sup> <sup>0</sup>*,* <sup>1</sup>*,...*, where *<sup>w</sup>* is a vector containing the unknowns ˜*lij* and *<sup>u</sup>*˜*ij* . Each fixed-point iteration is called a **sweep**. Algorithm 10.8 outlines the method. --

An important question is how to choose initial values for the factor entries. In some applications, a natural initial guess is available. For example, in timedependent problems, the *L* and *U* computed in the previous time step may provide appropriate initial guesses for the current time step. In other cases, a possible strategy is to symmetrically scale *A* to have a unit diagonal and then take the initial *L* -



#### **ALGORITHM 10.8 Fixed-point ILU factorization**

**Input:** Matrix *A*, the target sparsity pattern S{*L* +*U* }, and initial incomplete factors *L* and *U* . --



**Output:** Updated incomplete factors.

**for** *(i, j )* ∈ S{*L* + *U* } **do** Set ˜*lij* and *u*˜*ij* to the given initial values **end for for** *sweep* = 1*,* 2*,...* **do for** *(i, j )* ∈ S{*L* - + *U* -} **do if** *i>j* **then** Compute ˜*lij* using (10.2) **else** Compute *u*˜*ij* using (10.3) **end if end for end for** -

and *U* to be the lower and upper parts of the scaled matrix, respectively. In practice, a few sweeps may be sufficient to generate preconditioners that are competitive in terms of quality to those generated via classical incomplete Gaussian elimination algorithms.

The following features differentiate the fixed-point ILU algorithm from classical methods and make it attractive for parallel computations on modern architectures.


To enhance the preconditioner quality, it is possible to interleave employing Algorithm 10.8 with a strategy that dynamically adapts S{*L* + *U* } to the problem characteristics. In an iterative process based on highly parallel building blocks, this allows threshold-based ILU factorizations to be computed on parallel sharedmemory architectures and enables the efficient use of streaming-based architectures such as GPUs.


#### **10.8 Ordering in Incomplete Factorizations**

Ordering algorithms designed for sparse direct solvers (see Chapter 8) can have a positive effect on the robustness and performance of preconditioned Krylov subspace methods. However, the best choice of ordering for an incomplete factorization preconditioner may not be the same as for a complete factorization, and although the effects of orderings and how much fill-in is allowed have been widely demonstrated, they are not yet fully understood.

When the natural (lexicographic) ordering is used, the incomplete triangular factors resulting from a no-fill ILU factorization can be highly ill-conditioned, even if the matrix *A* is well-conditioned. Allowing more fill-in in the factors, for example, using ILU(1) instead of ILU(0), may solve the problem, but it is not guaranteed. In some cases, preordering *A* can lead to more stable factors, and hence more effective preconditioners, but, again, this is not understood.

Minimum degree orderings (Section 8.1.2) are popular for direct methods, but for incomplete factorizations care is needed to ensure the dropping strategy is compatible with the ordering. This is because the rows (and columns) of the permuted matrix can have significantly different counts. In this situation, using memory-based dropping in which the maximum allowable number of filled entries in a row of *L* is the same for all rows may not be a good approach. An alternative strategy is to specify that the permitted fill-in is proportional to that of the complete factorization (which can be computed using Algorithm 4.3).

A level set ordering that reduces the bandwidth or profile of a matrix can be employed (Section 8.2). For complete factorizations, the fill-in in the factors can be much greater than for nested dissection or minimum degree, but for incomplete factorizations they can be highly effective. In particular, using an RCM ordering (Algorithm 8.3) is often found to lead to a higher quality preconditioner than using the natural ordering. RCM-based orderings are generally inexpensive to compute and can provide good reuse of computer caches.

Global orderings based on a divide-and-conquer approach and, in particular, nested dissection (Section 8.4) are important for complete factorizations. But such orderings cut local connections within the graph of *A* and, when used with incomplete factorizations, can lead to poor quality preconditioners. A global ordering that specifically targets incomplete factorizations is a **red–black** (or checker board) ordering. Consider the graph G*(A)* of an SPD matrix *A* that arises from a simple 5-point discretization of a PDE on a regular two-dimensional grid and colour its vertices using two colours so that no vertices of the same colour are incident to the same edge (see Figure 10.5). Because no red vertex is adjacent to any other red vertex, the red vertices are an independent set; similarly, the black vertices are an independent set. The red vertices can be processed in any order, provided they are all processed before any of the black vertices. This can make red–black orderings convenient for parallel implementations and is the main reason that they are often employed with stationary iterative methods.

**Figure 10.5** A model problem to illustrate a red–black ordering. The grid-based graph G*(A)* with coloured vertices is given together with the matrix *A* (left) and the symmetrically permuted matrix using the red–black ordering (right).

A bipartite graph is an undirected graph whose vertices can be partitioned into two disjoint sets such that each set is an independent set (Section 6.3.1). It follows that the red–black ordering exists if and only if G*(A)* is bipartite. The ordering is often generalized as follows. Start by finding a set of mutually non-adjacent vertices (that is, an independent set) and flag them as red vertices. After the elimination of the variables corresponding to the red vertices and employing a sparsification strategy, a Schur complement matrix is obtained. Proceed by finding a set of mutually nonadjacent vertices in this matrix, flag them as red vertices and continue recursively. This approach can lead to a significant decrease in the condition number of the preconditioned matrix. Another generalization for arbitrary graphs is to employ more colours (multicolouring). Again, the colouring can be exploited in parallel computations. For efficiency, load balancing of the coloured vertices needs to be considered. Because reordering the vertices can affect the convergence rate of an iterative solver, the potential gain in parallel performance at each iteration may be offset by a slower convergence rate.

#### **10.9 Exploiting Block Structure**

Blocking methods for complete factorizations can be adapted to incomplete factorizations. The aim is to speed up the computation of the factors and to obtain more effective preconditioners. In a block factorization, scalar operations of the form

$$l\_{ik} = a\_{lk} / \tilde{u}\_{kk}$$

are replaced by matrix operations

$$
\widetilde{L}\_{ib,kb} = A\_{ib,kb} \widetilde{U}\_{kb,kb}^{-1},
$$



and scalar multiplications of entries of the factors are replaced by matrix–matrix products. When dropping entries, instead of considering the absolute values, simple norms of the block entries (such as the one norm, max norm, or Frobenius norm) are used.

An incomplete factorization can start with the supernodal structure of the complete factors. If dropping is applied to individual columns, this structure is generally lost. To try and retain it, the dropping strategy can be modified either to drop the set of nonzeros of a row in the current supernode or to keep it. To obtain sufficiently sparse incomplete factors, it may be necessary to subdivide each supernode, allowing greater flexibility on how many rows are dropped. It is also possible to relax blocking operations in such a way that the supernodes are not exact but are allowed to incur some fill-in.

#### **10.10 Notes and References**

Sparsity structure was the main ingredient of the first algebraic preconditioners that were developed in the late 1950s. The nonzero structure represented the stencils resulting from the discretization of PDEs on structured grids. The earliest contribution is Buleev (1959), and this was later generalized to three-dimensional problems. An independent derivation and its interpretation as an incomplete factorization for a sparse matrix coming from a simple 5-point stencil is given in Varga (1960); other early work is by Baker & Oliphant (1960). For an overview of early contributions and the motivations behind incomplete factorizations, see Il in (1992); we also refer to the survey of Chan & van der Vorst (1997).

Important breakthroughs in the use of preconditioning using incomplete factorizations for practical problems came in two key papers. The first by Meijerink & van der Vorst (1977) recognized the importance of preconditioning for the conjugate gradient method. In the second, Kershaw (1978) proposed locally replacing pivots by a small positive number to prevent breakdown of the factorization. This paved the way for incomplete factorizations in which dropping is based solely on the size of the computed entries and which were introduced even earlier by Tuff & Jennings (1973).

The Crout incomplete LU factorization outlined in Algorithm 10.1 was implemented in a successful code for symmetric problems by Lin & Moré (1999), building on earlier ideas of Jones & Plassmann (1995) and Eisenstat et al. (1982) (see also Li et al., 2003 for later contributions to this approach). Algorithm 10.2 with a sparsification strategy that uses both a drop tolerance and a limit on the number of entries in each column of the incomplete factors was published in Saad (1994a) as the dual threshold ILUT method. For general nonsymmetric matrices, ILUT has proved very popular and has been developed further (see, for example, MacLachlan et al., 2012). But because it is based on the row factorization, it ignores symmetry in *A* and, if *A* is symmetric, the computed sparsity patterns of *L* and *U<sup>T</sup>* are normally different. In this case, a Crout incomplete factorization may be preferable. The hierarchy of sparsity structures based on the concept of levels is introduced in Watts-III (1981). The initial work has since been significantly improved, notably for parallel implementations by Hysom & Pothen (2002). The Euclid library is a scalable implementation of a parallel level-based ILU algorithm that is available as part of the *hypre* library of linear solvers (see Falgout et al., 2006, 2021). Scalable means that the incomplete factorization and triangular solve timings remain nearly constant when the problem size *n* is scaled in proportion to the number of processors. Another parallel level-based ILU preconditioner that uses an adaptive block implementation is proposed in Hénon et al. (2008).

The modified incomplete factorizations of Section 10.4 are described in Saad (2003b). A proof of Theorem 10.3 can be found in Bern et al. (2006), but it is also of interest to follow earlier work on asymptotic bounds for the condition number of matrices preconditioned by modified incomplete factorizations given in Dupont et al. (1968), Axelsson (1972), and Gustafsson (1978), while an elegant description is in Meurant (1999).

Incomplete factorizations with dynamic compensation originally introduced by Ajiz & Jennings (1984) have been routinely employed in practice. However, memory-limited approaches based on relaxing the strategy of Tismenetsky (1991) often lead to more efficient preconditioners; see Kaporin (1998) for a row-based construction that has recently been used by Konshin et al. (2017, 2019) to solve challenging practical problems. Scott & T˚uma (2014b) present a Crout construction of a sophisticated memory-limited incomplete factorization and provide a robust implementation for SPD systems as the package HSL\_MI28 within the HSL mathematical software library (Scott & T˚uma, 2014a); a variant for symmetric saddle point systems is also included in HSL.

Using fixed-point iterations for the parallel computation of incomplete factorizations is a relatively new idea that was proposed and analysed by Chow & Patel (2015). Interleaving a fixed-point iteration with a procedure that adjusts the sparsity pattern is proposed by Anzt et al. (2018). Other attempts to compute and use ILU preconditioners in parallel that build on the software package ILUPACK (Bollhöfer et al., 2012) are presented in Aliaga et al. (2016, 2019). A different approach to parallelize incomplete factorizations by relaxing supernodes is given by Gupta & George (2010).

Significant attention has been devoted to using orderings of *A* to try and improve the quality of incomplete factorization preconditioners. An early and often quoted comparison of reorderings for SPD problems is by Duff & Meurant (1989). For more general matrices, see Benzi et al. (1999), Oliker et al. (2002), or Osei-Kuffuor et al. (2015). Saad (1996a) and Saad & Zhang (1999) generalize red–black orderings and consider blocks and/or more colours; also of interest are the papers of Saad & Suchomel (2002), Li et al. (2003), and Carpentieri et al. (2014)).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Chapter 11 Sparse Approximate Inverse Preconditioners**

*While it is recognized that preconditioning the system often improves the convergence of a particular method, this is not always so. In particular, a successful preconditioner for one class of problems may prove ineffective on another class. – Gould & Scott (1998).*

*There is, of course, no such concept as a best preconditioner ... However, every practitioner knows when they have a good preconditioner which enables feasible computation and solution of problems. In this sense, preconditioning will always be an art rather than a science. – Wathen (2015).*

Consider a preconditioner *M* based on an incomplete LU (or Cholesky) factorization of a matrix *A*. *M*−1, which represents an approximation of *A*−1, is applied by performing forward and back substitution steps; this can present a computational bottleneck. An alternative strategy is to directly approximate *A*−<sup>1</sup> by explicitly computing *M*−1. Preconditioners of this kind are called **sparse approximate inverse** preconditioners. They constitute an important class of algebraic preconditioners that are complementary to the approaches discussed in the previous chapter. They can be attractive because when used with an iterative solver, they can require fewer iterations than standard incomplete factorization preconditioners that contain a similar number of entries while offering significantly greater potential for parallel computations.

From Theorem 7.3, the sparsity pattern of the inverse of an irreducible matrix *A* is dense, even when *A* is sparse. Therefore, if *A* is large, the exact computation of its inverse is not an option, and aggressive dropping is needed to obtain a sufficiently sparse approximation to *A*−<sup>1</sup> that can be used as a preconditioner. Fortunately, for a wide class of problems of practical interest, many of the entries of *A*−<sup>1</sup> are small in absolute value, so that approximating the inverse with a sparse *M*−<sup>1</sup> may be feasible, although capturing the large (important) values of *A*−<sup>1</sup> is a nontrivial task. Importantly, the computed *M*−<sup>1</sup> can have nonzeros at positions that cannot be obtained by either a complete or an incomplete factorization, and this can be


⎞


beneficial. Furthermore, although *A*−<sup>1</sup> is fully dense, the following result shows this is not the case for the factors of factorized inverses.

**Theorem 11.1 (Bridson & Tang 1999; Benzi & T˚uma 2000)** *Assume the matrix <sup>A</sup> is SPD, and let <sup>L</sup> be its Cholesky factor. Then* <sup>S</sup>{*L*−1} *is the union of all entries (i, j ) such that i is an ancestor of j in the elimination tree* T *(A).*

A consequence of this result is that *L*−<sup>1</sup> need not be fully dense. Considering this implication algorithmically, if *A* is SPD, it may be advantageous to preorder *A* to limit the number of ancestors that the vertices in T *(A)* have. For example, nested dissection may be applied to S{*A*} (Section 8.4). If S{*A*} is nonsymmetric, then it may be possible to reduce fill-in in the factors of *A*−<sup>1</sup> by applying nested dissection to <sup>S</sup>{*<sup>A</sup>* <sup>+</sup> *AT* }.

#### **11.1 Basic Approaches**

⎛

⎞

⎛

An obvious way to obtain an approximate inverse of *A* in factorized form is to compute an incomplete LU factorization of *A* and then perform an approximate inversion of the incomplete factors. For example, if incomplete factors *L* and *U* are available, approximate inverse factors can be found by solving the 2*n* triangular linear systems *Lx <sup>i</sup>* <sup>=</sup> *ei, U y* -

$$
\ddot{L}\chi\_l = e\_l, \quad \ddot{U}\chi\_l = e\_l, \quad 1 \le i \le n,
$$

where *ei* is the *i*-th column of the identity matrix. These systems can all be solved independently, and hence, there is the potential for significant parallelism. To reduce costs and to preserve sparsity in the approximate inverse factors, they may not need to be solved accurately. A disadvantage is that the computation of the preconditioner involves two levels of incompleteness, and because information from the incomplete factorization of *A* is passed into the second step, the loss of information can be excessive.

Another straightforward approach is based on bordering. Let *Aj* denote the principal leading submatrix of *A* of order *j* (*Aj* = *A*1:*j,*1:*<sup>j</sup>* ), and assume that its inverse factorization

$$A\_j^{-1} = W\_j D\_j^{-1} Z\_j^T$$

⎞

⎛

⎞

is known. Here *Wj* and *Zj* are unit upper triangular matrices, and *Dj* is a diagonal matrix. Consider the following scheme: ⎝⎠⎝⎠⎝⎠⎝⎠

$$
\begin{pmatrix} Z\_j^T & 0 \\ & 1 \\ z\_{j+1}^T & 1 \end{pmatrix} \begin{pmatrix} A\_j & A\_{1:j,j+1} \\ A\_{j+1,1:j} & a\_{j+1,j+1} \end{pmatrix} \begin{pmatrix} W\_j & w\_{j+1} \\ & \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} D\_j & 0 \\ & \\ 0 & d\_{j+1,j+1} \end{pmatrix},
$$

where for 1 ≤ *j<n*

$$\begin{aligned} w\_{j+1} &= -W\_j D\_j^{-1} Z\_j^T A\_{1:j,j+1}, \\\\ z\_{j+1} &= -Z\_j D\_j^{-1} W\_j^T A\_{j+1,1:j}^T, \\\\ d\_{j+1,j+1} &= a\_{j+1,j+1} + z\_{j+1}^T A\_j w\_{j+1} + A\_{j+1,1:j} w\_{j+1} + z\_{j+1}^T A\_{1:j,j+1}. \end{aligned}$$

Starting from *j* = 1, this suggests a procedure for computing the inverse factors of *A*. Sparsity can be preserved by dropping some entries from the vectors *wj*+<sup>1</sup> and *zj*+<sup>1</sup> once they have been computed. Sparsity and the quality of the preconditioner can be influenced by preordering *A*.

If *A* is symmetric, *W* = *Z* and the required work is halved. Furthermore, if *A* is SPD, then it can be shown that, in exact arithmetic, *djj >* 0 for all *j* and the process does not break down. In the general case, diagonal modifications may be required, which can limit the effectiveness of the resulting preconditioner.

Observe that the computations of *Z* and *W* are tightly coupled, restricting the potential to exploit parallelism. At each step *j* , besides a matrix–vector product with *Aj* , four sparse matrix–vector products involving *Wj* , *Zj* and their transposes are needed; these account for most of the work. The implementation is simplified if access to the triangular factors is available by columns as well as by rows.

### **11.2 Approximate Inverses Based on Frobenius Norm Minimization**

It is clear from the above discussion that alternative techniques for constructing sparse approximate inverse preconditioners are needed. We start by looking at schemes based on Frobenius norm minimization. Historically, these were the first to be proposed and offer the greatest potential for parallelism because both the construction of the preconditioner and its subsequent application can be performed in parallel.

#### *11.2.1 SPAI Preconditioner*

To describe the sparse approximate inverse (SPAI) preconditioner, it is convenient to use the notation *<sup>K</sup>* <sup>=</sup> *<sup>M</sup>*−1. The basic idea is to compute *<sup>K</sup>* <sup>≈</sup> *<sup>A</sup>*−<sup>1</sup> with its columns denoted by *kj* as the solution of the problem of minimizing *F* = *n*

$$\|I - AM^{-1}\|\_F^2 = \|I - AK\|\_F^2 = \sum\_{j=1}^n \|e\_j - Ak\_j\|\_2^2,\tag{11.1}$$

over all *K* with pattern S. This produces a right approximate inverse. A left approximate inverse can be computed by solving a minimization problem for *I* − *KA<sup>F</sup>* = *<sup>I</sup>* <sup>−</sup> *AT <sup>K</sup><sup>T</sup> <sup>F</sup>* . This amounts to computing a right approximate inverse for *AT* and taking the transpose of the resulting matrix. For nonsymmetric matrices, the distinction between left and right approximate inverses can be important. Indeed, there are situations where it is difficult to compute a good right approximate inverse but easy to find a good left approximate inverse (or vice versa). In the following discussion, we assume that a right approximate inverse is being computed.

The Frobenius norm is generally used because the minimization problem then reduces to least squares problems for the columns of *K* that can be computed independently and, if required, in parallel. Further, these least squares problems are all of small dimension when S is chosen to ensure *K* is sparse. Let J = {*i* | *kj (i)* = 0} be the set of indices of the nonzero entries in column *kj* . The set of indices of rows of *A* that can affect a product with column *kj* is I = {*m* | *Am,*<sup>J</sup> = 0}. Let |I| and <sup>|</sup><sup>J</sup> <sup>|</sup> denote the number of entries in <sup>I</sup> and <sup>J</sup> , respectively, and let*ej* <sup>=</sup> *ej (*I*)* be the vector of length |I| that is obtained by taking the entries of *ej* with row indices belonging to I. To solve (11.1) for *kj* , construct the |I|×|J | matrix *A* = *A*I*,*<sup>J</sup> and solve the small unconstrained least squares problem *ej* − *A* 

$$\min\_{\widehat{k\_j}} \|\widehat{e\_j} - \widehat{A}\widehat{k\_j}\|\_2^2. \tag{11.2}$$
 
$$\text{QR factorization of } \widehat{A}. \text{ Extending } \widehat{k\_j} \text{ to have length}$$

This can be done using a dense QR factorization of *A n* by setting entries that are not in J to zero gives *kj* . 

A straightforward way to construct S that does not depend on a sophisticated initial choice (but could, for example, be the identity or be equal to S{*A*}) proceeds as follows. Starting with a chosen column sparsity pattern J for *kj* , construct *A* , solve (11.2) for *kj* , set *kj (*<sup>J</sup> *)* <sup>=</sup> *kj* , and define the residual vector *rj* <sup>=</sup> *ej* <sup>−</sup> *<sup>A</sup>*1:*n,*<sup>J</sup>*kj .*

$$r\_j = e\_j - A\_{1:n, \mathcal{J}} \widehat{k}\_j \dots$$

If *rj* <sup>2</sup> <sup>=</sup> 0, then *kj* is not equal to the *<sup>j</sup>* -th column of *<sup>A</sup>*−1, and a better approximation can be derived by augmenting J . To do this, let L = {*l* |*rj (l)* = 0} and define J-

$$
\widetilde{\mathcal{J}} = \{ i \mid A\_{\mathcal{L},l} \neq 0 \} \backslash \mathcal{J}. \tag{11.3}
$$

These are candidate indices that can be added to J , but as there may be many of them, they need to be chosen to most effectively reduce *rj* 2. A possible heuristic is to solve for each *<sup>i</sup>* <sup>∈</sup> <sup>J</sup>the minimization problem

$$\min\_{\mu\_{\bar{l}}} \left\| r\_{\bar{l}} - \mu\_{\bar{l}} A e\_{\bar{l}} \right\|\_{2}^{2}.$$

This has the solution *μi* <sup>=</sup> *<sup>r</sup><sup>T</sup> <sup>j</sup> Aei/Aei*<sup>2</sup> <sup>2</sup> with residual *rj* <sup>2</sup> <sup>−</sup> *(r<sup>T</sup> <sup>j</sup> Aei)*<sup>2</sup>*/Aei*<sup>2</sup> 2*.* Indices*<sup>i</sup>* <sup>∈</sup> <sup>J</sup>for which this is small are appended to J . The process can be repeated until either the required accuracy is attained or the maximum number of allowed entries in J is reached. 

Solving the unconstrained least squares problem (11.2) after extending *A* to *A*I∪I *,* <sup>J</sup> <sup>∪</sup><sup>J</sup> is typically performed using updating. Assume the QR factorization of *A* is = 

$$
\widehat{A} = A\_{\mathcal{Z}, \mathcal{J}} = \mathcal{Q} \begin{pmatrix} R \\ 0 \end{pmatrix} = \begin{pmatrix} \mathcal{Q}\_1 \ \mathcal{Q}\_2 \end{pmatrix} \begin{pmatrix} R \\ 0 \end{pmatrix},
$$

where *Q*<sup>1</sup> is |I|×|J |. Here *Q* is an orthogonal matrix and *R* is an upper triangular matrix. The QR factorization of the extended matrix is ⎛ ⎝⎠

$$\begin{aligned} & \text{is } |\mathcal{L}| \times |\mathcal{J}|. \text{ Here } \mathcal{Q} \text{ is an orthogonal matrix and } \mathcal{R} \text{ is an upper-triangular of the extended matrix is} \\ & \text{QR factorization of the extended matrix is} \\ & \begin{pmatrix} \widehat{A} & A\mathcal{I}, \mathcal{J}' \\ & A\mathcal{I}', \mathcal{J}' \end{pmatrix} = \begin{pmatrix} \mathcal{Q} & \\ & I \end{pmatrix} \begin{pmatrix} \begin{array}{R} \mathcal{Q}\_1^T A \mathcal{I}\_{\mathcal{I}}, \mathcal{J}' \\ \mathcal{Q}\_2^T A \mathcal{I}\_{\mathcal{I}, \mathcal{J}'} \\ & A\mathcal{I}', \mathcal{J}' \end{pmatrix} \\ & = \begin{pmatrix} \mathcal{Q} & \\ & I \end{pmatrix} \begin{pmatrix} I & \\ & \mathcal{Q}' \end{pmatrix} \begin{pmatrix} \mathcal{R} & \mathcal{Q}\_1^T A \mathcal{I}\_{\mathcal{I}}, \mathcal{J}' \\ & \mathcal{R}' \\ & 0 \end{pmatrix}, \end{aligned}$$

where *Q* and *R* are from the QR factorization of the *(*|I |+|I|−|J |*)* × |J | matrix

$$
\begin{pmatrix} \mathcal{Q}\_2^T A\_{\mathcal{Z}, \mathcal{F}} \\ A\_{\mathcal{Z}', \mathcal{F}'} \end{pmatrix} \cdot
$$

Factorizing this matrix and updating the trailing QR factorization to get the new *kj* is much more efficient than computing the QR factorization of the extended matrix from scratch.

Construction of the SPAI preconditioner is summarized in Algorithm 11.1. The maximum number of entries *nzj* that is permitted in *kj* must be at least as large as the number of entries in the initial sparsity pattern J*<sup>j</sup>* . Updating can be used to compute a new *kj* for each pass through the while loop; the number of passes is typically small (for example, if a good initial sparsity pattern is available, a single pass may be sufficient).

The example in Figure 11.1 illustrates Algorithm 11.1. Starting with a tridiagonal matrix, it considers the computation of the first column *k*<sup>1</sup> of the inverse matrix *K*. The algorithm starts with J<sup>1</sup> = {1*,* 2}.

When *A* is symmetric, there is no guarantee that the computed *K* will be symmetric. One possibility is to use *(K* <sup>+</sup> *<sup>K</sup><sup>T</sup> )/*<sup>2</sup> to approximate *<sup>A</sup>*−1. The SPAI preconditioner is not sensitive to the ordering of *A*. This has the advantage that *A* can be partitioned and preordered in whatever way is convenient, for instance,

⎞

#### **ALGORITHM 11.1 SPAI preconditioner (right-looking approach)**

**Input:** Nonsymmetric matrix *A*, a convergence tolerance *η >* 0, an initial sparsity pattern J*<sup>j</sup>* and the maximum number *nzj* of permitted entries for column *j* of *K* (1 ≤ *j* ≤ *n*).

**Output:** *<sup>K</sup>* <sup>≈</sup> *<sup>A</sup>*−<sup>1</sup> with columns *kj* (1 <sup>≤</sup> *<sup>j</sup>* <sup>≤</sup> *<sup>n</sup>*). 1: **for** *j* = 1 : *n* **do** The columns may be computed in parallel 2: Set J = J*<sup>j</sup>* and I = {*m* | *A(m,*J *)* = 0}, *rj* <sup>2</sup> = ∞ 3: Construct *A* <sup>=</sup> *<sup>A</sup>*I*,*<sup>J</sup> and solve (11.2) for *kj* 4: *rj* <sup>=</sup> *ej* <sup>−</sup> *<sup>A</sup>*1:*n,*<sup>J</sup>*kj* 5: **while** |J | *< nzj* and *rj* <sup>2</sup> *> η* **do** 6: Construct <sup>J</sup>given by (11.3) <sup>J</sup>is the candidate set 7: Determine new indices <sup>J</sup> <sup>⊂</sup> <sup>J</sup>to add to J 8: I = {*m* | *Am,*<sup>J</sup> = 0} \ I 9: I = I ∪ I and J = J ∪ J Augment the sparsity pattern 10: Construct new *A* <sup>=</sup> *<sup>A</sup>*I*,*<sup>J</sup> and new *kj* Update the QR factorization 11: *rj* <sup>=</sup> *ej* <sup>−</sup> *<sup>A</sup>*1:*n,*<sup>J</sup>*kj* 12: **end while** 13: *kj (*<sup>J</sup> *)* <sup>=</sup> *kj* Extend *kj* to *kj* by setting entries not in <sup>J</sup> to zero. 14: **end for**

$$A = \begin{pmatrix} 10 & -2 & & & \\ -1 & 10 & -2 & & \\ & -1 & 10 & -2 & \\ & & -1 & 10 & -2 \\ & & & -1 & 10 \end{pmatrix}, \widehat{A} = \begin{pmatrix} 10 & -2 \\ -1 & 10 \\ -1 & -1 \end{pmatrix}, \widehat{k\_1} = \begin{pmatrix} 0.1020 \\ 1.00 \times 10^{-5} \\ 0.0101 \end{pmatrix}, r\_1 = \begin{pmatrix} 1.00 \times 10^{-4} \\ 1.00 \times 10^{-3} \\ 1.01 \times 10^{-2} \\ 0 \\ 0 \end{pmatrix}.$$

$$\widehat{A} = \begin{pmatrix} 10 & -2 & & \\ -1 & 10 & -2 \\ -1 & 10 & & \\ & -1 & 10 \\ & & -1 \end{pmatrix}, \widehat{k\_1} = \begin{pmatrix} 0.1021 \\ 0.0104 \\ 0.0010 \\ 0.0010 \end{pmatrix}, r\_1 = \begin{pmatrix} 1.0 \times 10^{-5} \\ 1.1 \times 10^{-4} \\ 1.1 \times 10^{-3} \\ 1.0 \times 10^{-2} \\ 0 \end{pmatrix}, k\_1 = \begin{pmatrix} 0.1021 \\ 0.0104 \\ 0.0010 \\ 0 \\ 0 \end{pmatrix}.$$

**Figure 11.1** An illustration of computing the first column of a sparse approximate inverse using the SPAI algorithm with *nz*<sup>1</sup> = 3. On the top line is the initial tridiagonal matrix *A* followed by the matrix *A* and the vectors *<sup>k</sup>*<sup>1</sup> and *<sup>r</sup>*<sup>1</sup> on the first loop of Algorithm 11.1. The bottom line presents the updated matrix *A*ˆ that is obtained on the second loop by adding the third row and column of *<sup>A</sup>* and the corresponding vectors *<sup>k</sup>*<sup>1</sup> and *<sup>r</sup>*<sup>1</sup> and, finally, *<sup>k</sup>*1. Here the numerical values have been appropriately rounded.

to better suit the needs of a distributed implementation, without worrying about the impact on the subsequent convergence rate of the solver. The disadvantage is that orderings cannot be used to reduce fill-in and/or improve the quality of this

approximate inverse. For instance, if *A*−<sup>1</sup> has no small entries, SPAI will not find a sparse *K*, and because the inverse of a permutation of *A* is just a permutation of *A*−1, no permutation of *A* will change this.

### *11.2.2 FSAI Preconditioner: SPD Case*

We next consider a class of preconditioners based on an incomplete inverse factorization of *A*−1. The factorized sparse approximate inverse (FSAI) preconditioner for an SPD matrix *A* is defined as the product

$$M^{-1} = G^T G,$$

where the sparse lower triangular matrix *G* is an approximation of the inverse of the (complete) Cholesky factor *L* of *A*. Theoretically, a FSAI preconditioner is computed by choosing a lower triangular sparsity pattern S*<sup>L</sup>* and minimizing *F* = *tr* % *(I* <sup>−</sup> *GL)<sup>T</sup> (I* <sup>−</sup> *GL)*&

$$\|I - GL\|\_F^2 = tr\left[ (I - GL)^T (I - GL) \right],\tag{11.4}$$

over all *G* with sparsity pattern S*L*. Here *tr* denotes the matrix trace operator (that is, the sum of the entries on the diagonal). The computation of *G* can be performed without knowing *L* explicitly. Differentiating (11.4) with respect to the entries of *G* and setting to zero yields

$$(GLL^T)\_{lj} = (GA)\_{lj} = (L^T)\_{lj} \quad \text{for all} \quad (i,j) \in \mathcal{S}\_L. \tag{11.5}$$

Because *<sup>L</sup><sup>T</sup>* is an upper triangular matrix while <sup>S</sup>*<sup>L</sup>* is a lower triangular pattern, the matrix equation (11.5) can be rewritten as

$$(GA)\_{lj} = \begin{cases} 0 & i \neq j, \quad (i, j) \in \mathcal{S}\_L, \\ l\_{il} & i = j. \end{cases} \tag{11.6}$$

*G* is not available from (11.6) because *L* is unknown. Instead, *G* is computed such that

$$(\overline{G}A)\_{lj} = \delta\_{l,j} \quad \text{for all} \quad (i,j) \in \mathcal{S}\_L,\tag{11.7}$$

where *δi,j* is the Kronecker delta function (*δi,j* = 1 if *i* = *j* and is equal to 0, otherwise). The FSAI factor *G* is then obtained by setting

$$G = D\overline{G},$$

where *D* is a diagonal scaling matrix. An appropriate choice for *D* is

$$D = \left[diag(\overline{G})\right]^{-1/2},\tag{11.8}$$

so that

$$(GAG^T)\_{li} = 1, \quad 1 \le i \le n.$$

The following result shows that the FSAI preconditioner exists for any nonzero pattern S*<sup>L</sup>* that includes the main diagonal of *A*.

**Theorem 11.2 (Kolotilina & Yeremin 1993)** *Assume A is SPD. If the lower triangular sparsity pattern* S*<sup>L</sup> includes all diagonal positions, then G exists and is unique.*

*Proof* Set I*<sup>i</sup>* = {*j* | *(i, j )* ∈ S*L*}, and let *A*I*i,* <sup>I</sup>*<sup>i</sup>* denote the submatrix of order *nzi* = |I*i*| of entries *akl* such that *k,l* ∈ I*i*. Let *g*¯*<sup>i</sup>* and *gi* be dense vectors containing the nonzero coefficients in row *i* of *G* and *G*, respectively. Using this notation, solving (11.7) decouples into solving *n* independent SPD linear systems

$$A\_{\mathcal{T}\_l, \mathcal{T}\_l} \bar{\mathbf{g}}\_l = e\_{n\mathcal{Z}\_l}, \quad 1 \le i \le n,$$

where the unit vectors are of length *nzi*. Moreover,

$$A\_{\overline{\mathcal{L}}\_i, \overline{\mathcal{L}}\_i} g\_i = e\_{n\underline{z}\_i}, \quad 1 \le i \le n,$$

$$\text{tors are of length } n\underline{z}\_i. \text{ Moreover,}$$

$$(\overline{G} A \overline{G}^T)\_{li} = \sum\_{j \in \mathbb{Z}\_l} \delta\_{l,j} \overline{G}\_{lj} = \overline{G}\_{li} = (A\_{\overline{\mathcal{L}}\_i, \overline{\mathcal{L}}\_i}^{-1})\_{li}.$$

This implies that the diagonal entries of *D* given by (11.8) are nonzero. Consequently, the computed rows of *G* exist and provide a unique solution.

The procedure for computing a FSAI preconditioner is summarized in Algorithm 11.2. The computation of each row of *G* can be performed independently; thus, the algorithm is inherently parallel, but each application of the preconditioner requires the solution of triangular systems.

The following theorem states that *G* computed using Algorithm 11.2 is in some sense optimal.

**Theorem 11.3 (Kolotilina et al. 2000)** *Let L be the Cholesky factor of the SPD matrix A. Given a lower triangular sparsity pattern* S*<sup>L</sup> that includes all diagonal positions, let G be the FSAI preconditioner computed using Algorithm 11.2. Then any lower triangular matrix G*<sup>1</sup> *with its sparsity pattern contained in* S*<sup>L</sup> and (G*1*AGT* <sup>1</sup> *)ii* = 1 *(*1 ≤ *i* ≤ *n) satisfies*

$$||I - GL||\_F \le ||I - G\_{\parallel}L||\_F.$$

#### **ALGORITHM 11.2 FSAI preconditioner**

**Input:** SPD matrix *A* and a lower triangular sparsity pattern S*<sup>L</sup>* that includes all diagonal positions.

**Output:** Lower triangular matrix *<sup>G</sup>* such that *<sup>A</sup>*−<sup>1</sup> <sup>≈</sup> *GGT* .

1: **for** *i* = 1 : *n* **do** 2: Construct I*<sup>i</sup>* = {*j* | *(i, j )* ∈ S*L*}, *A*I*i,*I*<sup>i</sup>* and set *nzi* = |I*i*| 3: Solve *A*I*i,*I*<sup>i</sup> g*¯*<sup>i</sup>* = *enzi* 4: Scale *gi* <sup>=</sup> *diig*¯*<sup>i</sup>* with *dii* <sup>=</sup> *(g*¯*i,nzi)*−1*/*<sup>2</sup> *<sup>g</sup>*¯*i,nzi* is the last component of *<sup>g</sup>*¯*<sup>i</sup>* 5: Extend *gi* to the row *Gi,*<sup>1</sup>:*<sup>i</sup>* by setting entries that are not in I*<sup>i</sup>* to zero 6: **end for**

The performance of the FSAI preconditioner is highly dependent on the choice of S*L*. If entries are added to the pattern, then, as the following result shows, the preconditioner is more accurate, but it is also more expensive.

**Theorem 11.4 (Kolotilina et al. 2000)** *Let L be the Cholesky factor of the SPD matrix A. Given the lower triangular sparsity patterns* S*L*<sup>1</sup> *and* S*L*<sup>2</sup> *that include all diagonal positions, let the corresponding FSAI preconditioners computed using Algorithm 11.2 be G*<sup>1</sup> *and G*2*, respectively. If* S*L*<sup>1</sup> ⊆ S*L*2*, then*

$$||I - G\_2L||\_F \le ||I - G\_1L||\_F.$$

#### *11.2.3 FSAI Preconditioner: General Case*

The FSAI algorithm can be extended to a general matrix *A*. Two input sparsity patterns are required: a lower triangular sparsity pattern S*<sup>L</sup>* and an upper triangular sparsity pattern S*<sup>U</sup>* , both containing the diagonal positions. First, lower and upper triangular matrices *GL* and *GU* are computed such that

$$(\overline{G}\_L A)\_{ij} = \delta\_{l,j} \quad \text{for all} \quad (i,j) \in \mathcal{S}\_L,$$

$$(A \overline{G}\_U)\_{ij} = \delta\_{l,j} \quad \text{for all} \quad (i,j) \in \mathcal{S}\_U.$$

Then *D* is obtained as the inverse of the diagonal of the matrix *GLAGU ,* and the final nonsymmetric FSAI factors are given by *GL* = *GL* and *GU* = *GU D.* The computation of the two approximate factors can be performed independently.

This generalization is well defined if, for example, *A* is nonsymmetric positive definite. There is also theory that extends existence to special classes of matrices, including M- and H-matrices. In more general cases, solutions to the reduced systems may not exist, and modifications (such as perturbing the diagonal entries) are needed to circumvent breakdown.

#### *11.2.4 Determining a Good Sparsity Pattern*

The role of the input pattern is to preserve sparsity by filtering out entries of *A*−<sup>1</sup> that contribute little to the quality of the preconditioner. For instance, it might be appropriate to ignore entries with a small absolute value, while retaining the largest ones. Unfortunately, the locations of large entries in *A*−<sup>1</sup> are generally unknown, and this makes the a priori sparsity choice difficult. A possible exception is when *A* is a banded SPD matrix. In this case, the entries of *A*−<sup>1</sup> are bounded in an exponentially decaying manner along each row or column. Specifically, there exist 0 *<ρ<* 1 and a constant *c* such that for all *i, j*

$$|(A^{-1})\_{lj}| \le c\rho^{|i-j|}.$$

The scalars *ρ* and *c* depend on the bandwidth and the condition number of *A*. For matrices with a large bandwidth and/or a high condition number, *c* can be very large and *ρ* close to one, indicating extremely slow decay. However, if the entries of *A*−<sup>1</sup> can be shown to decay rapidly, then a banded *M*−<sup>1</sup> should be a good approximation to *<sup>A</sup>*−1. In this case, <sup>S</sup>*<sup>L</sup>* can be chosen to correspond to a matrix with a prescribed bandwidth.

A common choice for a general *A* is S*L*+S*<sup>U</sup>* = S{*A*}, motivated by the empirical observation that entries in *A*−<sup>1</sup> that correspond to nonzero positions in *A* tend to be relatively large. However, this simple choice is not robust because entries of *<sup>A</sup>*−<sup>1</sup> that lie outside <sup>S</sup>{*A*} can also be large. An alternative strategy based on the Neumann series expansion of *A*−<sup>1</sup> is to use the pattern of a small power of *A*, i.e., <sup>S</sup>{*A*2} or <sup>S</sup>{*A*3}. By starting from the lower and upper triangular parts of *<sup>A</sup>*, this approach can be used to determine candidates S*<sup>L</sup>* and S*<sup>U</sup>* . While approximate inverses based on higher powers of *A* are often better than those corresponding to *A*, there is no guarantee they will result in good preconditioners. Furthermore, even small powers of *A* can be very dense, thus slowing down the construction and application of the preconditioner. A possible remedy is to use the power of a sparsified *A*. Alternatively, the pattern can be chosen dynamically by retaining the largest terms in each row of the preconditioner as it is computed, which is what the SPAI algorithm does. Another possibility is to implicitly determine S*<sup>L</sup>* + S*<sup>U</sup>* as follows. Starting with a simple sparsity pattern, compute the corresponding FSAI preconditioner *G*1. Then choose a pattern based on *G*1*AGT* <sup>1</sup> and apply the FSAI algorithm to *G*1*AGT* <sup>1</sup> to obtain *G*2. Finally, set the preconditioner to *G*2*G*1. Despite running the FSAI algorithm twice, this approach can be worthwhile. Unfortunately, the choice of the best technique for generating a FSAI preconditioner and its sparsity pattern is highly problem dependent.

### **11.3 Factorized Approximate Inverses Based on Incomplete Conjugation**

An alternative way to obtain a factorized approximate inverse is based on incomplete conjugation (*A*-orthogonalization) in the SPD case and on incomplete *A*biconjugation in the general case. For SPD matrices, the approach represents an approximate Gram–Schmidt orthogonalization that uses the *A*-inner product *., . <sup>A</sup>*. An important attraction is that the sparsity patterns of the approximate inverse factors need not be specified in advance; instead, they are determined dynamically as the preconditioner is computed.

#### *11.3.1 AINV Preconditioner: SPD Case*

When *A* is an SPD matrix, the AINV preconditioner is defined by an approximate inverse factorization of the form

$$A^{-1} \approx M^{-1} = Z D^{-1} Z^T,$$

where the matrix *Z* is unit upper triangular and *D* is a diagonal matrix with positive entries. The factor *Z* is a sparse approximation of the inverse of the *L<sup>T</sup>* factor in the square root-free factorization of *A*. *Z* and *D* are computed directly from *A* using an incomplete *A*-orthogonalization process applied to the columns of the identity matrix. If entries are not dropped, then a complete factorization of *A*−<sup>1</sup> is computed and *Z* is significantly denser than *L<sup>T</sup>* . To preserve sparsity, at each step of the computation, entries are discarded (for example, using a prescribed threshold, or according to the positions of the entries, or by retaining a chosen number of the largest entries in each column), resulting in an approximate factorization of *A*−1.

There are several variants. Algorithms 11.3 and 11.4 outline left-looking and right-looking approaches, respectively. Practical implementations need to employ sparse matrix techniques. The left-looking scheme computes the *j* -th column *zj* of *Z* as a sparse linear combination of the previous columns *z*1*,...,zj*−1. The key is determining which multipliers (the *α*'s in Steps 4 and 5 of the two algorithms, respectively) are nonzero and need to be computed. This can be achieved very efficiently by having access to both the rows and columns of *A* (although the algorithm does not require that *A* is explicitly stored—only the capability of forming inner products involving the rows of *A* is required). For the right-looking approach, the crucial part for each *j* is the update of the sparse submatrix of *Z* composed of the columns *j* + 1 to *n* that are not yet fully computed. Here, only one row of *A* is used in the outer loop of the algorithm. Therefore, *A* can be generated on-the-fly by rows. The DS format can be used to store the partially computed *Z* (Section 1.3.2). As with complete factorizations, the efficiency of the computation and application of AINV preconditioners can benefit from incorporating blocking.

#### **ALGORITHM 11.3 AINV preconditioner (left-looking approach)**

**Input:** SPD matrix *A* and sparsifying rule. **Output:** *<sup>A</sup>*−<sup>1</sup> <sup>≈</sup> *ZD*−1*Z<sup>T</sup>* with *<sup>Z</sup>* a unit upper triangular matrix and *<sup>D</sup>* a diagonal matrix with positive diagonal entries.


### *11.3.2 AINV Preconditioner: General Case*

In the general case, the AINV preconditioner is given by an approximate inverse factorization of the form

$$A^{-1} \approx M^{-1} = WD^{-1}Z^T,$$

where *Z* and *W* are unit upper triangular matrices and *D* is a diagonal matrix. *Z* and *W* are sparse approximations of the inverses of the *L<sup>T</sup>* and *U* factors in the LDU factorization of *A*, respectively. Starting from the columns of the identity matrix, *A*-biconjugation is used to compute the factors. Algorithm 11.5 outlines the rightlooking approach. Note it offers two possibilities for computing the entries *djj* of *D* that are equivalent in exact arithmetic if the factorization is breakdown-free. The left-looking variant given in Algorithm 11.3 can be generalized in a similar way.

Figure 11.2 illustrates the sparsity patterns of the AINV factors for a matrix arising in circuit simulation. S{*A*} is symmetric, but the values of the entries of *<sup>A</sup>* are nonsymmetric. The sparsity pattern <sup>S</sup>{*<sup>W</sup>* <sup>+</sup> *<sup>Z</sup><sup>T</sup>* } is given, where *<sup>W</sup>* and *Z* are computed using Algorithm 11.5 with sparsification based on a dropping tolerance of 0*.*5. Also given are the patterns S{*L* - + *U* -} and S{*L* -<sup>−</sup><sup>1</sup> <sup>+</sup> *<sup>U</sup>* -<sup>−</sup>1} for the incomplete factors *L* and *U* computed using Algorithm 10.2 (see Section 10.2) with a dropping tolerance of 0*.*1 and at most 10 entries in each row of *L* - + *U* -. Note that this dual dropping strategy is one of the most popular ways of employing

**Figure 11.2** An example to illustrate the difference between the sparsity patterns of the AINV factors and those of the inverse of the ILU factors. The sparsity pattern S{*A*} of the matrix *A* is given (top left) together with the patterns of the factorized approximate inverse factors <sup>S</sup>{*<sup>W</sup>* <sup>+</sup>*Z<sup>T</sup>* } (top right), the ILU factors S{*L* - + *U* -} (bottom left), and their inverses S{*L* -<sup>−</sup><sup>1</sup> <sup>+</sup> *<sup>U</sup>* -<sup>−</sup>1} (bottom right).

Algorithm 10.2; it is often denoted as ILUT(*p, τ* ), where *p* is the maximum number of entries allowed in each row and *τ* is the dropping tolerance. In this example, the parameters were chosen so that the number of entries in both *<sup>W</sup>* <sup>+</sup> *<sup>Z</sup><sup>T</sup>* and *<sup>L</sup>* + *U* is approximately equal, but the resulting sparsity patterns are clearly different. In particular, potentially important information is lost from S{*L* -<sup>−</sup><sup>1</sup> <sup>+</sup> *<sup>U</sup>* -<sup>−</sup>1}.

#### *11.3.3 SAINV: Stabilization of the AINV Method*

The following result is analogous to Theorem 9.4.

**Theorem 11.5 (Benzi et al. 1996)** *If A is a nonsingular M- or H-matrix, then the AINV factorization of A does not break down.*

For more general matrices, breakdown can happen because of the occurrence of a zero *djj* or, in the SPD case, negative *djj* . In practice, exact zeros are unlikely but very small *djj* can occur (near breakdown), which may lead to uncontrolled growth in the size of entries in the incomplete factors and, because such entries are not dropped when using a threshold parameter, a large amount of fill-in. The next theorem indicates how breakdown can be prevented when *A* is SPD through reformulating the *A*-orthogonalization.



#### **ALGORITHM 11.4 AINV preconditioner (right-looking approach)**

**Input:** SPD matrix *A* and sparsifying rule. **Output:** *<sup>A</sup>*−<sup>1</sup> <sup>≈</sup> *ZD*−1*Z<sup>T</sup>* with *<sup>Z</sup>* a unit upper triangular matrix and *<sup>D</sup>* a diagonal matrix with positive diagonal entries.


**Theorem 11.6 (Benzi et al. 2000; Kopal et al. 2012)** *Consider Algorithm 11.4 with no sparsification (Step 7 is removed). The following identity holds*

$$A\_{j,1:n}z\_k^{(j-1)} \equiv e\_j^T A z\_k^{(j-1)} = \langle z\_j^{(j-1)}, z\_k^{(j-1)} \rangle\_A, \quad 1 \le j \le k \le n.$$

*Proof* Because *AZ* <sup>=</sup> *<sup>Z</sup>*−*<sup>T</sup> <sup>D</sup>* and *<sup>Z</sup>*−*<sup>T</sup> <sup>D</sup>* is lower triangular, entries 1 to *<sup>j</sup>* <sup>−</sup> <sup>1</sup> of the vector *Az(j*−1*) <sup>k</sup>* are equal to zero. *Z* is unit upper triangular so entries *j* + 1 to *n* of its *j* -th column *z (j*−1*) <sup>j</sup>* are also equal to zero. Thus, *z (j*−1*) <sup>j</sup>* can be written as the sum *z* + *ej* , where entries *j* to *n* of the vector *z* are zero. The result follows.

This suggests using alternative computations within the AINV approach based on the whole of *A* instead of on its rows. The reformulation, which is called the stabilized AINV algorithm (SAINV), is outlined in Algorithm 11.6. It is breakdown-free for any SPD matrix *A* because the diagonal entries are *djj* = *z (j*−1*) <sup>j</sup> , z(j*−1*) <sup>j</sup> <sup>A</sup> >* 0*.* Practical experience shows that, while slightly more costly to compute, the SAINV algorithm gives higher quality preconditioners than the AINV algorithm. However, the computed diagonal entries can still be very small and may need to be modified.

The factors *Z* and *D* obtained with no sparsification can be used to compute the square root-free Cholesky factorization of *A*. The *L* factor of *A* and the inverse factor *Z* computed using Algorithm 11.6 without sparsification satisfy

#### **ALGORITHM 11.5 Nonsymmetric AINV preconditioner (right-looking approach)**

**Input:** Nonsymmetric matrix *A* and sparsifying rule.

**Output:** *<sup>A</sup>*−<sup>1</sup> <sup>≈</sup> *WD*−1*Z<sup>T</sup>* with *<sup>W</sup>* and *<sup>Z</sup>* unit upper triangular matrices and *<sup>D</sup>* <sup>a</sup> diagonal matrix.

1: [*z (*0*)* <sup>1</sup> *,...,z(*0*) <sup>n</sup>* ]=[*e*1*,...,en*] and [*w(*0*)* <sup>1</sup> *,...,w(*0*) <sup>n</sup>* ]=[*e*1*,...,en*] 2: **for** *j* = 1 : *n* **do** 3: *djj* <sup>=</sup> *(A*1:*n,j )<sup>T</sup> <sup>z</sup> (j*−1*) <sup>j</sup>* or *djj* <sup>=</sup> *Aj,*<sup>1</sup>:*<sup>n</sup> <sup>w</sup>(j*−1*) j* 4: **for** *k* = *j* + 1 : *n* **do** 5: *<sup>α</sup>* <sup>=</sup> *(A*1:*n,j )<sup>T</sup> <sup>z</sup> (j*−1*) <sup>k</sup> /djj* 6: *z (j ) <sup>k</sup>* = *z (j*−1*) <sup>k</sup>* <sup>−</sup> *αz(j*−1*) j* 7: Sparsify *z (j ) <sup>k</sup>* Drop entries from *z (j ) k* 8: *<sup>β</sup>* <sup>=</sup> *Aj,*<sup>1</sup>:*<sup>n</sup> <sup>w</sup>(j*−1*) <sup>k</sup> /djj* 9: *w(j ) <sup>k</sup>* <sup>=</sup> *<sup>w</sup>(j*−1*) <sup>k</sup>* <sup>−</sup> *βw(j*−1*) j* 10: Sparsify *w(j ) <sup>k</sup>* Drop entries from *<sup>w</sup>(j ) k* 11: **end for** 12: **end for** 13: *Z* = [*z (*0*)* <sup>1</sup> *,...,z(n*−1*) <sup>n</sup>* ] and *<sup>W</sup>* = [*w(*0*)* <sup>1</sup> *,...,w(n*−1*) <sup>n</sup>* ]

$$AZ = LD \quad \text{or} \quad L = AZD^{-1}.$$

Using *djj* = *z (j*−1*) <sup>j</sup> , z(j*−1*) <sup>j</sup> <sup>A</sup>*, and equating corresponding entries of *AZD*−<sup>1</sup> and *L*, gives

$$d\_{lj} = \frac{\langle z\_j^{(j-1)}, z\_l^{(j-1)} \rangle\_A}{\langle z\_j^{(j-1)}, z\_j^{(j-1)} \rangle\_A}, \quad 1 \le j \le i \le n.$$

Thus, the SAINV algorithm generates the *L* factor of the square root-free Cholesky factorization of *A* as a by-product of orthogonalization in the inner product *.,. <sup>A</sup>* at no extra cost and without breakdown.

The stabilization strategy can be extended to the nonsymmetric AINV algorithm using the following result.

**Theorem 11.7 (Benzi & T˚uma 1998; Bollhöfer & Saad 2002)** *Consider Algorithm 11.5 with no sparsification (Steps 7 and 10 removed). The following identities hold:*

#### **ALGORITHM 11.6 SAINV preconditioner (right-looking approach)**

**Input:** SPD matrix *A* and sparsifying rule. **Output:** *<sup>A</sup>*−<sup>1</sup> <sup>≈</sup> *ZD*−1*Z<sup>T</sup>* with *<sup>Z</sup>* a unit upper triangular matrix and *<sup>D</sup>* a diagonal matrix with positive diagonal entries.


$$A\_{j,1:n}z\_k^{(j-1)} = e\_j^T A z\_k^{(j-1)} = \langle w\_j^{(j-1)}, z\_k^{(j-1)} \rangle\_A,$$

$$\langle (A\_{1:n,j})^T w\_k^{(j-1)} = e\_j^T A^T w\_k^{(j-1)} = \langle z\_j^{(j-1)}, w\_k^{(j-1)} \rangle\_A, \quad 1 \le j \le k \le n.$$

The nonsymmetric SAINV algorithm obtained using this reformulation can improve the preconditioner quality, but it is not guaranteed to be breakdown-free.

#### **11.4 Notes and References**

Benzi & T˚uma (1999) present an early comparative study that puts preconditioning by approximate inverses into the context of alternative preconditioning techniques; see also Bollhöfer & Saad (2002, 2006), Benzi & T˚uma (2003), and Bru et al. (2008, 2010). The inverse by bordering method mentioned in Section 11.1 is from Saad (2003b).

The first use of approximate inverses based on Frobenius norm minimization is given by Benson (1973). A SPAI approach that can exploit a dynamically changing sparsity pattern S is introduced in Cosgrove et al. (1992); an independent and enhanced description is given in the influential paper by Grote & Huckle (1997). Later developments are presented in Holland et al. (2005), Jia & Zhang (2013), and Jia & Kang (2019). A comprehensive discussion on the choice of the sparsity pattern S can be found in Huckle (1999). Huckle & Kallischko (2007) consider modifying the SPAI method by probing or symmetrizing the approximate inverse and Bröker et al. (2001) look at using approximate inverses based on Frobenius norm minimization as smoothers for multigrid methods. Choosing sparsity patterns for a related approximate inverse with a particular emphasis on parallel computing is described by Chow (2000).

For nonsymmetric matrices, MI12 within the HSL mathematical software library computes SPAI preconditioners (see Gould & Scott, 1998 for details and a discussion of the merits and limitations of the approach). An early parallel implementation is given by Barnard et al. (1999). Dehnavi et al. (2013) present an efficient parallel implementation that uses GPUs and include comparisons with ParaSails (Chow, 2001). The latter handles SPD problems using a factored sparse approximate inverse and general problems with an unfactored sparse approximate inverse. A priori techniques determine S as a power of a sparsified matrix.

Original work on the FSAI preconditioner is by Kolotilina & Yeremin (1986, 1993). Its use in solving systems on massively parallel computers is presented in Kolotilina et al. (1992), while an interesting iterative construction can be found in Kolotilina et al. (2000). A parallel variant called ISAI preconditioning that combines a Frobenius norm-based approach with traditional ILU preconditioning is proposed by Anzt et al. (2018). FSAI preconditioning has attracted significant theoretical and practical attention. Recent contributions discuss not only its efficacy but also parallel computation, the use of blocks, supernodes, and multilevel implementations (Ferronato et al., 2012, 2014; Janna & Ferronato, 2011; Janna et al., 2010, 2013, 2015; Ferronato & Pini, 2018; Magri et al., 2018). Many of these enhancements are exploited in the FSAIPACK software of Janna et al. (2015).

The AINV preconditioner for SPD and nonsymmetric systems is introduced in Benzi et al. (1996) and Benzi & T˚uma (1998), respectively; see also Benzi et al. (1999) for a parallel implementation. However, the development of this type of preconditioner follows much earlier interest in factorized matrix inverses (for example, Morris, 1946 and Fox et al., 1948). For the SAINV algorithm, see Benzi et al. (2000) and Kharchenko et al. (2001). Theoretical and practical properties of the AINV and SAINV factorizations are studied in a series of papers by Kopal et al. (2012, 2016, 2020).

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **References**


J. Scott, M. T˚uma, *Algorithms for Sparse Linear Systems*, Necas Center Series, ˇ https://doi.org/10.1007/978-3-031-25820-6


Xu, J. & Zikatanov, L. (2017). Algebraic multigrid methods. *Acta Numerica, 26*, 591–721.


## **Index**

#### **A**

Active submatrix, 33 Acyclic graph, 21 Adjacency graph, 24 Adjacency set, 20 Algebraic preconditioner, 3, 164, 167 Alternating path, 104 Analyse phase, 6 Approximate inverse preconditioner, 205 Approximate minimum degree, 144 Assembly tree, 67, 79, 84, 120 Augmenting path, 104

#### **B**

Backward error, 113 Bandwidth, 145 Basic linear algebra subroutines (BLAS), 8, 69, 77, 123 Bipartite graph, 103 Bit compatibility, 10, 86 Block pivoting, 117 Block triangular form, 43, 108 Bordered form doubly bordered, 155 singly bordered, 157 Breadth-first search, 27, 148, 189

#### **C**

CCS format, 14 Cholesky factorization, 2, 5, 53, 73 incomplete, 197 Cholesky symbolic factorization, 53 Column replication principle, 54, 89 Complete pivoting, 116 Complexity, 10 Condition number, 113, 127 Coordinate format, 13 CSR format, 13 Current degree, 137 Cuthill McKee ordering, 146

#### **D**

Degree, 20 current, 137 outmatching, 140 Delayed pivots, 120 Depth-first search, 27, 44, 64, 78, 108 Diagonally dominant matrix, 172, 193 Digraph, 19 Directed acyclic graph (DAG), 22, 89, 92, 95 task DAG, 76, 80 Directed graph, 19 DS format, 15, 152, 215 Dual variables, 107, 131 Dulmage-Mendelsohn decomposition, 108

#### **E**

Eisenstat trick, 170 Elimination matrix, 32 Elimination tree, 55 column, 98 nonsymmetric, 97 Envelope, 145 Extend-add, 83, 102 External degree, 139

© The Author(s) 2023 J. Scott, M. T˚uma, *Algorithms for Sparse Linear Systems*, Necas Center Series, ˇ https://doi.org/10.1007/978-3-031-25820-6

#### **F**

Factorizable matrix, 6 Factorization bordering, 37, 95 breakdown, 173 Cholesky, 5, 73 generic form, 33 incomplete, 164, 172 incomplete Crout, 187 LDU, 5 left-looking, 36 LU, 5 multifrontal, 81, 100 right-looking, 35 square root-free Cholesky, 5, 33 up-looking, 77 variants, 34 Fiedler vector, 150 Filled graph, 31 Fill-in, 31 Frobenius normal form, 43 Frobenius norm minimization, 207 Frontal matrix, 82, 100 Frontal method, 82, 87 Fundamental supernode, 70

#### **G**

Gaussian elimination, 5, 31, 89 Generated element, 82 GPS algorithm, 148 Graph acyclic, 21 adjacency, 24 ancestor, 22 bipartite, 103, 125, 200 child, 23 clique, 20, 38, 141 column elimination tree, 98 condensation, 44 connected, 23 DAG, 22 degree, 20 descendant, 22 diameter, 147 digraph, 19 directed, 19 eccentricity, 147 elimination, 33 elimination tree, 55 equireachable, 93 filled, 31 fill-path, 21

forest, 23, 55 incident edge, 20 independent set, 103, 199 induced subgraph, 19 isomorphic, 20 leaf vertex, 23 level, 27 level sets, 148 mass elimination, 139 maximal clique, 67 maximum matching, 103 neighbours, 20 nonsymmetric elimination tree, 97 parent, 23 path, 21 path compression, 59 peripheral vertices, 147 postordering, 28, 64 preordering, 28 pruned subtree, 56 pruning, 95 pseudo-diameter, 147 pseudo-peripheral vertices, 147 quotient, 44, 141 reachability, 21 reachable set, 22, 39 rooted tree, 23 root vertex, 23, 55 row subtree, 57, 79 search, 27 sibling, 23 skeleton, 62 spanning tree, 23 strongly connected, 23 strongly connected components, 23, 44 subgraph, 19 supervariable, 47 symmetric pruning, 96 topological ordering, 26 transitive reduction, 92 traversal, 27, 190 tree, 23 undirected, 19 virtual tree, 59 walk, 21 weighted, 24 Growth factor, 115

#### **H**

H-matrix, 172, 173, 217 Hybrid solver, 179 Hypergraphs, 161

Index 241

#### **I**

Ill-conditioning, 113, 126 Incomplete factorization, 164, 172, 185 Crout variant, 187 dynamic compensation, 193 fixed-point ILU, 197 IC(), 188 ILU(), 188 level-based, 188 memory-limited, 194 modified (MILU), 190 row variant, 187 Indistinguishable vertices, 46, 138 Irreducible matrix, 42 Iterative methods Krylov subspace, 164 stationary, 164 Iterative refinement, 128

#### **K**

Krylov subspace methods, 166

#### **L**

Level sets, 148 List, 26 linked list, 12, 14 queue, 27 stack, 27 LU factorization, 5 column, 35 incomplete, 185

#### **M**

Markowitz pivoting, 151 Matching, 103 extreme, 107 perfect, 103 Matrix block triangular form, 43 column elimination, 32 dense, 5 factorizable, 6 inertia, 123 irreducible, 42 permutation, 25 reducible, 42 saddle point, 4 skeleton, 62 sparse, 1, 5 sparsity pattern, 5 strong Hall, 42, 99, 108

strongly regular, 6 structural singularity, 5 symmetric indefinite, 4 symmetric positive definite, 4 Maximum matching, 103 Minimum degree algorithm, 137 M-matrix, 171, 173, 217 Multifrontal method, 81, 100, 120 Multiple minimum degree algorithm, 143

#### **N**

Nested dissection ordering, 152 Non-cancellation assumption, 31

#### **O**

Ordering approximate minimum degree, 144 Cuthill McKee, 146 global, 135 level-based, 146 local, 135 Markowitz, 151 minimum deficiency, 136 minimum degree, 137 minimum discarded fill, 176 minimum fill-in, 136 multiple minimum degree, 143 nested dissection, 152, 199, 206 postordering, 28, 64 preordering, 28 red-black, 199 Reverse Cuthill McKee, 146, 199 sparse matrix, 135 spectral method, 150 topological, 26, 64

#### **P**

Parter's rule, 37 Partial pivoting, 36, 99, 114, 116 Path, 21, 39 alternating, 104 augmenting, 104 Permutation matrix, 25 Permutation vector, 25 Pivoting 2 × 2, 119 blocks, 117 complete, 116 incomplete factorization, 175 partial, 36, 99, 114, 116 relaxed, 123

Pivoting (*cont.*) rook, 117, 119 sparse indefinite, 119 static, 123 threshold, 118 Pivots, 33 delayed, 120 Preconditioner AINV, 215 algebraic, 167 approximate inverse, 205 deflation, 180 domain decomposition, 181 FSAI, 211 incomplete factorization, 172, 185 Jacobi, 169 left, 167 polynomial, 176 right, 167 SAINV, 217 Schur complement, 178 SPAI, 207 SSOR, 169 two-sided, 167 Profile, 145

#### **Q**

Queue, 27, 190 Quotient graph, 44, 141 condensation, 44

#### **R**

Reducible matrix, 42 Relaxed pivoting, 123 Reverse Cuthill McKee ordering, 146, 199 Rook pivoting, 117, 119 Row replication principle, 89

#### **S**

Scaling, 129 equilibration, 130 matching-based, 130 Schur complement, 34, 81, 89, 178, 194, 200 Skeleton graph, 62 Skeleton matrix, 62 Spectral condition number, 128, 166

Spectral radius, 165 Static pivoting, 123 Storage CCS, 14 coordinate format, 13 CSR, 13 DS, 15, 152, 215 linked list, 12, 14 VBR, 16 Strongly connected components, 23, 44 Strongly regular matrix, 6 Supernode, 67, 76, 201 amalgamation, 69 fundamental, 70 LU factorization, 100 relaxed, 69 Supervariable, 47, 76, 138 Symbolic factorization, 6 Cholesky, 53 Symmetric pruning, 96 Symmetry index, 4

#### **T**

Threshold pivoting, 118 Transitive reduction, 92 Transversal, 43 Tree, 23 assembly, 67 elimination, 55, 97, 98 leaf vertex, 23 root, 23 row subtree, 57, 79 virtual tree, 59

**U**

Undirected graph, 19 Update matrix, 82

#### **V**

VBR format, 16 Vector permutation, 25 sparse, 6 Vertex labelling, 19 Vertex separator, 152